[00:09:35] (03CR) 10CRusnov: "For what it's worth, I have tested this on af-netbox, with some intentional failures to make sure the various avenues work." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [00:12:26] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) a:03ayounsi ` Apr 16 23:20:49 cr4-ulsfo kernel: spin lock 0xfffff80012ce73c0 (turnstile lock) held by 0xfffff8000941d560 (tid 100012) too long Apr 16 23:20:49 cr4-ulsfo kernel: panic: spin lock h... [00:12:38] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) p:05Triage→03Normal [00:40:37] (03PS16) 10CRusnov: Port MakeVM to a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [00:40:45] (03CR) 10CRusnov: Port MakeVM to a cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [01:25:14] 10Operations, 10Phabricator, 10serviceops: Puppet using the phabricator class fails with: "secret(): invalid secret waf/modsecurity_admin.conf" - https://phabricator.wikimedia.org/T221182 (10Paladox) [01:32:35] 10Operations, 10Phabricator, 10serviceops: Puppet using the phabricator class fails with: "secret(): invalid secret waf/modsecurity_admin.conf" - https://phabricator.wikimedia.org/T221182 (10Paladox) 05Open→03Declined Oh, nvm, labs/private was out of date too. [02:12:07] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:39:01] (03PS3) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 [02:39:31] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:40:01] (03PS4) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 [02:40:03] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:40:21] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:43:55] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:44:54] (03PS5) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 [02:45:02] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:45:22] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:46:07] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:51:18] (03PS6) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 [02:51:26] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:51:46] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:52:42] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [02:53:53] (03Abandoned) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [03:11:17] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Andrew) [03:11:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Andrew) [03:11:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Andrew) Imaging is blocked by https://phabricator.wikimedia.org/T209707 [03:46:45] (03PS1) 10Andrew Bogott: cloud: add new cloud-ns and cloud-recursor names to cloudservices hosts [dns] - 10https://gerrit.wikimedia.org/r/504490 (https://phabricator.wikimedia.org/T221183) [03:49:43] PROBLEM - puppet last run on db1115 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:57:12] (03PS2) 10Andrew Bogott: cloud: add new cloud-ns and cloud-recursor names to cloudservices hosts [dns] - 10https://gerrit.wikimedia.org/r/504490 (https://phabricator.wikimedia.org/T221183) [04:05:11] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:06:20] (03CR) 10CDanis: [C: 03+1] "Makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/504482 (https://phabricator.wikimedia.org/T220887) (owner: 10Dzahn) [04:16:11] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:31:39] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:31:41] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27111 MB (5% inode=99%) [04:49:57] RECOVERY - Disk space on elastic1017 is OK: DISK OK [05:01:27] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:31] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:20:13] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:27:55] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:30:59] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:32:43] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:34:31] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 421.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:46:41] RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:59:13] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:02:51] (03Abandoned) 10Vgutierrez: Add SPF record for toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/504243 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [06:30:29] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:32:07] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste] [06:47:33] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 51.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:56:53] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:31] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:07:47] !log rolling reboots of Swift backends in codfw for combined kernel/glibc/OpenSSL update [07:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:42] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1078 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504392 (owner: 10Jcrespo) [07:17:41] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1078 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504392 (owner: 10Jcrespo) [07:17:54] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1078 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504392 (owner: 10Jcrespo) [07:20:46] (03PS1) 10Gilles: Enable Event Timing origin trial on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504501 (https://phabricator.wikimedia.org/T216597) [07:21:31] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 with low load (duration: 01m 18s) [07:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:13] (03PS1) 10Urbanecm: Prepare initial configuration for iniciativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) [07:27:57] (03CR) 10jerkins-bot: [V: 04-1] Prepare initial configuration for iniciativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [07:29:02] (03PS1) 10Urbanecm: Add DNS entries for iniciativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 [07:29:29] (03PS2) 10Urbanecm: Add DNS entries for iniciativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) [07:39:25] (03CR) 10Gilles: [C: 03+2] Enable Event Timing origin trial on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504501 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [07:40:33] (03Merged) 10jenkins-bot: Enable Event Timing origin trial on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504501 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [07:40:47] (03CR) 10jenkins-bot: Enable Event Timing origin trial on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504501 (https://phabricator.wikimedia.org/T216597) (owner: 10Gilles) [07:45:43] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216597 Enable Event Timing origin trial on ruwiki and eswiki (duration: 01m 04s) [07:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:48] T216597: Event Timing origin trial - https://phabricator.wikimedia.org/T216597 [07:51:53] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:51:57] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/NavigationTiming: T216597 Event timing support (duration: 01m 01s) [07:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:01] T216597: Event Timing origin trial - https://phabricator.wikimedia.org/T216597 [08:01:02] that is strange, but most likely due to baackups, will disable alerts [08:01:29] (03PS2) 10Muehlenhoff: Assign role for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504352 [08:10:03] (03PS1) 10Vgutierrez: config: Move ACMEChiefConfig to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504510 (https://phabricator.wikimedia.org/T220518) [08:10:06] (03PS1) 10Vgutierrez: dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 [08:10:08] (03PS1) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [08:11:35] (03CR) 10jerkins-bot: [V: 04-1] config: Move ACMEChiefConfig to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504510 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:11:41] (03CR) 10jerkins-bot: [V: 04-1] dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [08:11:45] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:12:41] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 22.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:13:20] (03PS2) 10Vgutierrez: config: Move ACMEChiefConfig to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504510 (https://phabricator.wikimedia.org/T220518) [08:13:22] (03PS2) 10Vgutierrez: dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 [08:13:25] (03PS2) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [08:14:54] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [08:14:56] (03CR) 10jerkins-bot: [V: 04-1] dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [08:15:33] damn flake8 :) [08:23:05] (03PS1) 10Jcrespo: mariadb: Repool db1078 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504513 (https://phabricator.wikimedia.org/T219115) [08:28:16] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db1078 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504513 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [08:29:16] (03Merged) 10jenkins-bot: mariadb: Repool db1078 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504513 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [08:29:32] !log installing ghostscript security updates [08:29:33] (03PS3) 10Vgutierrez: dns: Move DNS operations to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 [08:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:36] (03PS3) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) [08:31:36] (03PS1) 10Ema: vcl: remove If-Cached support [puppet] - 10https://gerrit.wikimedia.org/r/504514 (https://phabricator.wikimedia.org/T220510) [08:34:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504514 (https://phabricator.wikimedia.org/T220510) (owner: 10Ema) [08:35:53] (03CR) 10jenkins-bot: mariadb: Repool db1078 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504513 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [08:38:03] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 fully (duration: 01m 00s) [08:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:01] thank you, logmsgbot [08:39:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:40:14] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:40:36] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 175, down: 0, shutdown: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:42:35] 10Operations, 10Thumbor, 10serviceops: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) a:05Gilles→03jijiki [08:42:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:43:16] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [08:45:02] 10Operations, 10Thumbor, 10serviceops: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) I'm not seeing the metrics show up in the "eqiad/prometheus" ops datasource in Grafana. I'm not sure how prometheus is supposed to be configured to collect the dat... [08:47:06] !log reimage prometheus1004 - T187987 [08:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:11] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [08:56:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:58:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:59:24] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:03:49] (03CR) 10Muehlenhoff: [C: 03+2] Assign role for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504352 (owner: 10Muehlenhoff) [09:08:30] (03CR) 10DCausse: [C: 03+1] profile::elasticsearch::cirrus: Don't duplicate udev stuff [puppet] - 10https://gerrit.wikimedia.org/r/503709 (owner: 10Alex Monk) [09:09:33] !log restart eventlogging on eventlog1002 due to errors in processors and consumer lag accumulated after the last Kafka Jumbo roll restart [09:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:43] (03PS1) 10Ema: cache_upload: add cc_command to VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/504515 [09:10:48] (03CR) 10Ema: [C: 03+2] cache_upload: add cc_command to VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/504515 (owner: 10Ema) [09:11:32] moritzm: \o/ [09:12:45] (03PS1) 10Muehlenhoff: Fix typo in Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/504516 [09:12:50] (03PS2) 10Ema: vcl: remove If-Cached support [puppet] - 10https://gerrit.wikimedia.org/r/504514 (https://phabricator.wikimedia.org/T220510) [09:13:34] (03CR) 10Ema: [C: 03+2] vcl: remove If-Cached support [puppet] - 10https://gerrit.wikimedia.org/r/504514 (https://phabricator.wikimedia.org/T220510) (owner: 10Ema) [09:13:52] (03PS2) 10Muehlenhoff: Fix typo in Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/504516 [09:14:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/504516 (owner: 10Muehlenhoff) [09:17:08] !log swift eqiad-prod continue ms-be1013 decom - T220590 [09:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:13] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [09:20:35] elukey: we have a running KDC, I'll puppetise the config next [09:21:37] * elukey dances [09:22:05] a rare reaction to someone mentioning Kerberos :-) [09:23:59] moritzm: with the notable exception of Hades [09:24:36] <_joe_> moritzm: indeed. [09:29:24] fyi Looks like the telia link in ashburn is down, so folks are starting to report they can't reach wmf servers. https://lg.telia.net/?type=ping&router=ash-b1&address=208.80.153.224 [09:29:27] https://downdetector.com/status/wikipedia [09:33:41] hmm [09:33:46] let's see what we can do about that [09:37:31] RECOVERY - Long running screen/tmux on prometheus1004 is OK: OK: SCREEN detected but not long running. [09:37:57] (03PS1) 10Muehlenhoff: Align profile::kerberos::client with new profiles [puppet] - 10https://gerrit.wikimedia.org/r/504518 [09:38:15] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 13 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1],Exec[xfs_label-/dev/sdm3],Exec[xfs_label-/dev/sdm4] [09:38:42] known, I'll downtime it [09:38:43] ^ this is known, server has broken disk and will be decommed, silencing [09:38:54] ah, go ahead :-) [09:39:41] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 175, down: 2, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [09:42:52] AlexZ: telia had a maintenance window. Their looking glass was pretty much expected to report an error [09:44:29] I was the one who told AlexZ and I have full recovery on my end [09:45:06] yeah the maintenance window on telia's side seems to have ended (we haven't received confirmation of that yet, so it might reoccur) [09:45:48] ah figures, yeah I'm on the west coast US.. so had to do some poking since it looked fine for me. [09:45:54] Izhidez: for how long were you unable to access the sites? [09:46:42] even with that maint window, there should not have been an issue for long, so definitely worth investigating [09:47:45] akosiaris: confirmed just before 0500 local and I just switched to a proxy at 0516 local (all EST) [09:48:01] so I don't know how far past that it went [09:48:49] Seems like it resolved in the past 10 min from what I can see. [09:56:06] (03PS17) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [09:57:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM as well. Would you want to schedule a time for deploying this change? I 'll be happy to deploy it with you." [puppet] - 10https://gerrit.wikimedia.org/r/469262 (owner: 10Fomafix) [09:57:33] (03PS28) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [09:57:35] (03PS18) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [09:59:01] AlexZ: Izhidez thanks for reporting, please let us know if it reoccurs later. [10:00:24] fsero: i'm off to bed, so I won't be of much use for that sadly [10:00:57] (03PS19) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [10:01:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Cmjohnson Repooled and fixed. [10:04:03] (03PS2) 10Muehlenhoff: Align profile::kerberos::client with new profiles [puppet] - 10https://gerrit.wikimedia.org/r/504518 [10:04:54] (03CR) 10Ema: "updated pcc here: https://puppet-compiler.wmflabs.org/compiler1002/15851/" [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:06:31] (03CR) 10Muehlenhoff: [C: 03+2] Align profile::kerberos::client with new profiles [puppet] - 10https://gerrit.wikimedia.org/r/504518 (owner: 10Muehlenhoff) [10:07:44] (03PS2) 10Fsero: registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) [10:08:51] 10Operations, 10Operations-Software-Development: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 (10aborrero) [10:12:29] (03PS3) 10Fsero: registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) [10:12:43] 10Operations, 10Operations-Software-Development: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 (10aborrero) I would like to propose these 2 totally untested patches: `name=cookbooks,lines=10 commit 05c43faf249e015e7f91191e77f965b329d3377f Author: Artu... [10:13:01] (03CR) 10Fsero: [C: 03+2] "i forgot to include the task :)" [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [10:13:53] 10Operations, 10Operations-Software-Development: spicerack/cookbook: add additional arguments IRC/SAL logging - https://phabricator.wikimedia.org/T221212 (10Volans) p:05Triage→03Normal Yes indeed, that's already in the plan for improvements. The problem here is that the logging is done by the framework and... [10:14:29] (03PS4) 10Fsero: registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) [10:15:01] (03PS1) 10Muehlenhoff: Add profile::kerberos::client to KDC role [puppet] - 10https://gerrit.wikimedia.org/r/504519 [10:19:23] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::kerberos::client to KDC role [puppet] - 10https://gerrit.wikimedia.org/r/504519 (owner: 10Muehlenhoff) [10:20:42] (03CR) 10Fsero: [C: 03+2] registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [10:20:59] (03PS5) 10Fsero: registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) [10:22:00] (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [10:22:45] (03PS2) 10Urbanecm: Prepare initial configuration for iniciativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) [10:24:55] 10Operations, 10Traffic: Allow running several ATS instances in the same server - https://phabricator.wikimedia.org/T221217 (10Vgutierrez) [10:25:19] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [10:25:22] 10Operations, 10Traffic: Allow running several ATS instances in the same server - https://phabricator.wikimedia.org/T221217 (10Vgutierrez) [10:28:18] (03CR) 10Jbond: "overall looks like a goo improvement to me" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [10:29:38] (03CR) 10Volans: flake8: enforce import order and adopt W504 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [10:34:16] (03CR) 10Jbond: flake8: enforce import order and adopt W504 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [10:40:23] !log installing Java security updates on kafka/analytics cluster [10:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:20] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [10:46:41] (03CR) 10Volans: [V: 03+2 C: 03+2] Updated src to v0.1.9 and rebuilt wheels [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/504013 (owner: 10Volans) [10:54:30] RECOVERY - EDAC syslog messages on thumbor1004 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [10:55:13] (03PS1) 10Fsero: registryha: open 5001 to prometheus nodes, removing nginx metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) [10:56:02] (03CR) 10jerkins-bot: [V: 04-1] registryha: open 5001 to prometheus nodes, removing nginx metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [10:57:15] !log volans@deploy1001 Started deploy [debmonitor/deploy@f049b3b]: Deploy Debmonitor v0.1.9 [10:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:35] (03PS2) 10Fsero: registryha: open 5001 to prometheus nodes, removing nginx metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) [10:57:47] I have a change scheduled for SWAT in a few minutes [10:58:04] but unfortunately there’s another event in the office for around half an hour [10:58:16] !log volans@deploy1001 Finished deploy [debmonitor/deploy@f049b3b]: Deploy Debmonitor v0.1.9 (duration: 01m 00s) [10:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:19] is it okay if I do my SWAT change on :30 instead of :00? [10:58:20] (03CR) 10jerkins-bot: [V: 04-1] registryha: open 5001 to prometheus nodes, removing nginx metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [10:58:23] (13:30 EU time) [10:59:12] (03PS3) 10Fsero: registryha: open 5001 to prometheus nodes, removing nginx metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) [10:59:18] Lucas_WMDE: Still in the window? [10:59:24] yes [10:59:26] second half [10:59:38] Should be fine, yeah [10:59:42] ok thanks [10:59:44] see you then [10:59:47] Doesn't have to start on the hour [10:59:51] I'm planning to steal the window after that ^ [11:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T1100). [11:00:04] Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] (03CR) 10Fsero: "PCC is also happy https://puppet-compiler.wmflabs.org/compiler1002/15853/" [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [11:00:20] (03CR) 10Fsero: [C: 03+2] registryha: open 5001 to prometheus nodes, removing nginx metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/504526 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [11:02:04] (03PS1) 10Mathew.onipe: prometheus: add maps cassandra job [puppet] - 10https://gerrit.wikimedia.org/r/504529 (https://phabricator.wikimedia.org/T221055) [11:08:26] (03CR) 10Filippo Giunchedi: "+Eric, since this is the 'services' Prometheus instance. I'm ok with maps cassandra living here or in the 'ops' instance" [puppet] - 10https://gerrit.wikimedia.org/r/504529 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [11:19:12] o/ [11:19:14] I’m here now [11:19:18] starting with my patch [11:20:06] (03PS3) 10Lucas Werkmeister (WMDE): Enable suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) [11:20:12] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [11:21:20] (03Merged) 10jenkins-bot: Enable suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [11:22:51] Lucas_WMDE: your change is on mwdebug1002, please test [11:24:51] hm, not seeing any effect yet… [11:26:00] is the job queue lagged? [11:26:43] reedy@deploy1001:~$ mwscript showJobs.php testwikidatawiki [11:26:43] 0 [11:26:46] Not for that wiki at least [11:26:54] weird [11:28:22] it still complains that the parameter “constraint status” must be “mandatory constraint” [11:28:30] even after I’ve *removed* that parameter from the constraint in question [11:29:00] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:29:42] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:31:13] (03CR) 10jenkins-bot: Enable suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [11:33:13] okay, manually running the job from eval.php seems to have resolved that [11:33:26] so I guess showJobs.php doesn’t support the EventBus job queue, and its output is actually meaningless? [11:33:33] dcausse: ^ [11:33:44] That is possible [11:33:59] If so... We should fix that, or at least output something useful from runJobs.php [11:35:37] well, um [11:35:44] turns out I had enabled xdebug in my browser [11:35:48] instead of X-Wikimedia-Debug… [11:35:52] :shame: [11:35:59] anyways, my test works now, so deploying [11:37:50] haha [11:38:00] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:504380|Enable suggestion constraint status on testwikidata (T221108, T204439)]] (duration: 01m 01s) [11:38:29] okay, I’m done [11:38:33] Amir1: your turn? [11:38:51] Okay, I want to create hiwikisource [11:39:00] (where’s stashbot’s acknowledgement?) [11:39:02] Amir1: Is that a good idea? [11:39:08] Lucas_WMDE: Want to file a job about showJobs.php being useless in WMF prod? [11:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:15] T204439: new constraint level for suggestions - https://phabricator.wikimedia.org/T204439 [11:39:16] T221108: Configure suggestion constraint level on Test Wikidata - https://phabricator.wikimedia.org/T221108 [11:39:18] ah, there we go [11:39:22] Reedy: not sure anymore, 20 minutes left [11:39:24] Amir1: is that still part of SWAT or should I log that it’s done? [11:39:36] no it's not SWAT [11:39:37] Amir1: More, that I don't think we know if addWiki.php is fixed (nor if anyone has tried to fix it) [11:39:44] !log EU SWAT done [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:53] Reedy: well I’m not sure if it actually is useless [11:40:12] Reedy: yeah, the Aaron's patch went in but I can't say for sure that it fixes the issue. It happened again [11:40:12] reedy@deploy1001:~$ mwscript showJobs.php enwiki [11:40:12] 0 [11:40:12] reedy@deploy1001:~$ mwscript showJobs.php dewiki [11:40:12] 0 [11:40:13] well, okay, it also says 0 jobs for wikidatawiki which can’t possibly be true [11:40:15] Not sure I believe that [11:40:17] yeah [11:40:27] I’ll search for a task and open one if it doesn’t exist yet [11:40:35] thanks :) [11:41:19] Amir1: I see no patches (to addWiki.php) since the last deploy [11:41:50] Lucas_WMDE: https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/822343ea1ff48fb25a24dd6a7155ef38564066f4 also lol [11:44:37] I mean the one before that [11:45:23] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/486406 [11:46:09] That'd landed already? [11:46:20] Or was that only after the first attempt? [11:47:48] I don't see how that'd help with the invalid references to ES that it wrote for the initial pages [11:48:29] task for showJobs.php: https://phabricator.wikimedia.org/T221224 (feel free to improve) [11:52:58] Reedy: That landed before attempt of hywwiki [11:53:11] so I'm not sure how to proceed here [11:53:17] Did a bug get filed for the blob links being wrong? [11:53:59] If not, it'd be a good idea and CC Aaron [11:55:15] yeah [11:55:19] I should make one [11:56:58] RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [11:57:34] RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [12:00:39] er what invalid references? [12:00:44] Reedy: [12:01:06] apergos: Can't remember the specifics, but when hywwiki was created.. [12:01:14] The links pointing to ES were completely wrong [12:01:45] es = external store or elastic search? and if it's external store do you mean entries in the text table? [12:02:07] external store [12:02:19] Yeah, text table [12:02:33] and if it's bad entries in the text table were they cleaned up or are they still there? I can go look i guess but [12:02:35] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10aborrero) [12:02:39] if you happen to know... [12:03:12] I think Amir tried to fix some by hand [12:03:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504309 (owner: 10Arturo Borrero Gonzalez) [12:03:22] But we ended up just creating new revisions and rev-del-ing the bad ones [12:04:23] ok thanks for the info [12:05:03] It was for just the "autocreated" pages as part of addwiki [12:05:07] Anything done by users after was fine [12:07:37] 10Operations, 10Puppet: Create canary roles for all canaries - https://phabricator.wikimedia.org/T221226 (10jbond) p:05Triage→03Low [12:08:26] PROBLEM - puppet last run on cp1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:53] (03PS1) 10Jbond: Canary roles: create a canary role for the aqs canary server [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) [12:09:40] right [12:10:16] !log briefly stop all prometheus on prometheus1003 to finish metrics rsync - T187987 [12:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:22] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [12:10:27] there will be unknowns in icinga [12:13:50] (03CR) 10Eevans: [C: 03+1] "> +Eric, since this is the 'services' Prometheus instance. I'm ok" [puppet] - 10https://gerrit.wikimedia.org/r/504529 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [12:15:40] (03PS1) 10Filippo Giunchedi: hieradata: Prometheus v2 for prometheus1004 [puppet] - 10https://gerrit.wikimedia.org/r/504540 (https://phabricator.wikimedia.org/T187987) [12:16:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks Eric! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504529 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [12:16:40] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 32.72 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:17:59] checking ^ [12:19:16] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 76.45 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:19:34] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:52] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:18] 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10hashar) [12:21:44] godog: I can see an equivalent drop on database traffic, so the drop is real [12:22:02] jynus: I think it was due to me restarting prometheus [12:22:07] oh [12:22:11] so real, but in metrics [12:22:18] gotcha [12:23:35] yeah [12:23:40] I can see, however, an increase in traffil afterwards [12:23:48] on a non-prometheus metrics source [12:23:52] *traffic [12:24:48] could be just a coincidence, the increase is not out of the ordinary [12:27:19] I can't find any corresponding drop in traffic on core routers, looks like it was indeed an artifact [12:27:30] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: Prometheus v2 for prometheus1004 [puppet] - 10https://gerrit.wikimedia.org/r/504540 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [12:28:12] (03CR) 10Muehlenhoff: Canary roles: create a canary role for the aqs canary server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:30:20] (03CR) 10Muehlenhoff: Canary roles: create a canary role for the aqs canary server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:33:30] !log running some ferm tests on graphite2002 [12:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:16] (03PS3) 10Filippo Giunchedi: prometheus: set v2 max block duration to 24h [puppet] - 10https://gerrit.wikimedia.org/r/499742 (https://phabricator.wikimedia.org/T187987) [12:36:30] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: set v2 max block duration to 24h [puppet] - 10https://gerrit.wikimedia.org/r/499742 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [12:38:06] (03CR) 10Jbond: Canary roles: create a canary role for the aqs canary server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:39:49] (03CR) 10Volans: [C: 04-1] "Small missing bit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:40:16] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:41:16] (03CR) 10Muehlenhoff: Canary roles: create a canary role for the aqs canary server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:41:28] (03PS2) 10Jbond: Canary roles: create a canary role for the aqs canary server [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) [12:42:26] (03CR) 10Jbond: Canary roles: create a canary role for the aqs canary server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [12:44:28] !log bounce prometheus instances on prometheus[12]003 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/499742 [12:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:02] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:47:33] (03Abandoned) 10Jbond: Revert "puppet: Refactor of the base::puppet class" [puppet] - 10https://gerrit.wikimedia.org/r/504364 (owner: 10Jbond) [12:47:42] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:51:38] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:57:50] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1004.eqiad.wmnet [12:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:23] (03PS2) 10Filippo Giunchedi: prometheus: add maps cassandra job [puppet] - 10https://gerrit.wikimedia.org/r/504529 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:04:28] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add maps cassandra job [puppet] - 10https://gerrit.wikimedia.org/r/504529 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:12:26] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:14:12] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:27:35] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10hashar) [13:28:40] (03PS1) 10Ottomata: eventgate-analytics - bump resources and add comments about debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/504547 (https://phabricator.wikimedia.org/T220661) [13:33:26] (03PS1) 10Jcrespo: Fix typo on dbprov1002.mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/504549 (https://phabricator.wikimedia.org/T219399) [13:34:34] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10hashar) [13:35:09] (03CR) 10CDanis: [C: 03+1] Fix typo on dbprov1002.mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/504549 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [13:35:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/504549 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [13:39:54] (03CR) 10Jcrespo: [C: 03+2] Fix typo on dbprov1002.mgmt entry [dns] - 10https://gerrit.wikimedia.org/r/504549 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo) [13:41:51] 10Operations, 10Traffic: Allow running several ATS instances on the same server - https://phabricator.wikimedia.org/T221217 (10ema) p:05Triage→03Normal [13:44:14] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:45:14] (03PS1) 10Mathew.onipe: cassandra: provide support for single instance [puppet] - 10https://gerrit.wikimedia.org/r/504551 (https://phabricator.wikimedia.org/T221055) [13:46:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - bump resources and add comments about debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/504547 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [13:47:51] !log reimage prometheus2004 - T187987 [13:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:07] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [13:48:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:54] (03PS2) 10Mathew.onipe: cassandra: prometheus exporter for single instance [puppet] - 10https://gerrit.wikimedia.org/r/504551 (https://phabricator.wikimedia.org/T221055) [13:48:58] volans: ---^ magic [13:49:09] elukey: lol [13:49:16] (03PS1) 10Filippo Giunchedi: hieradata: run Prometheus v2 on prometheus2004 [puppet] - 10https://gerrit.wikimedia.org/r/504552 (https://phabricator.wikimedia.org/T187987) [13:50:23] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: run Prometheus v2 on prometheus2004 [puppet] - 10https://gerrit.wikimedia.org/r/504552 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [13:50:32] (03PS2) 10Filippo Giunchedi: hieradata: run Prometheus v2 on prometheus2004 [puppet] - 10https://gerrit.wikimedia.org/r/504552 (https://phabricator.wikimedia.org/T187987) [13:50:45] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/15854/" [puppet] - 10https://gerrit.wikimedia.org/r/504551 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:51:30] godog, elukey: ^ [13:52:38] (03PS1) 10Jbond: RAID: stop processing fact if device found via pci id [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) [13:52:44] !log upgrading hadoop cdh distrubition to 5.16.1 on all the Hadoop-related nodes - T218343 [13:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:48] T218343: Upgrade analytics cluster to Cloudera CDH 5.16.1 - https://phabricator.wikimedia.org/T218343 [13:52:56] this needs a shutdown of the hadoop cluster --^ [13:56:09] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [13:56:12] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [13:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:12] !log otto@deploy1001 scap-helm eventgate-analytics finished [13:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:31] 10Operations, 10fundraising-tech-ops, 10netops: configure switch ports for frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T221232 (10Jgreen) [13:56:36] (03PS1) 10ArielGlenn: let xmlabstracts use multiple paths to find AbstractFilter [dumps] - 10https://gerrit.wikimedia.org/r/504557 [13:58:08] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:58:10] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: prometheus exporter for single instance [puppet] - 10https://gerrit.wikimedia.org/r/504551 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:58:25] (03PS3) 10Filippo Giunchedi: cassandra: prometheus exporter for single instance [puppet] - 10https://gerrit.wikimedia.org/r/504551 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:58:27] onimisionipe: LGTM [13:58:40] godog: Thanks! [14:02:22] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:03:34] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:06:39] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:09:09] (03CR) 10Volans: [C: 04-1] "This would break our support of md devices if I read it correctly as it's normal to have both a hardware raid and a software one in many o" [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:09:47] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 14.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:12:03] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [14:12:22] this one wasn't downtimed, by bad [14:12:27] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [14:12:43] it was expected [14:12:59] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [14:12:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:07] 10Operations, 10ops-codfw, 10DC-Ops: labtestneutron2002: refresh/rename to cloudnet2002-dev - https://phabricator.wikimedia.org/T214370 (10Papaul) 05Open→03Resolved Complete ` papaul@asw-b-codfw# show | compare [edit interfaces ge-5/0/11] - description cloudnet2002-dev-eth2-disable; ` ` papaul@asw-b-c... [14:14:14] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:14:44] (03CR) 10Jbond: "hear is an example" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:16:40] (03PS1) 10Fsero: registryha: fixing ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/504560 (https://phabricator.wikimedia.org/T221099) [14:17:56] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces ge-1/0/16] - description labtestcontrol2003; + description cloudcontrol2003-de... [14:18:01] (03PS2) 10Fsero: registryha: fixing ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/504560 (https://phabricator.wikimedia.org/T221099) [14:19:18] (03PS8) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [14:19:20] (03PS6) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [14:19:22] (03PS1) 10Jcrespo: mariadb-backups: Set dbprov100[12] as spare for reimage [puppet] - 10https://gerrit.wikimedia.org/r/504562 (https://phabricator.wikimedia.org/T219399) [14:24:31] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labtestcontrol2001 [dns] - 10https://gerrit.wikimedia.org/r/504564 [14:25:00] (03CR) 10Fsero: [C: 03+2] registryha: fixing ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/504560 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [14:25:24] (03CR) 10Volans: [C: 04-1] "I'm aware of the other change and discussion and I agree we should report only one between hpssacli and ssacli. I see two options here:" [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:34:42] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [14:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:45] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [14:34:45] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:21] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [14:35:23] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:35:23] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:01] (03CR) 10ArielGlenn: [C: 03+2] let xmlabstracts use multiple paths to find AbstractFilter [dumps] - 10https://gerrit.wikimedia.org/r/504557 (owner: 10ArielGlenn) [14:36:12] (03PS3) 10Andrew Bogott: cloud: add new cloud-ns and cloud-recursor names to cloudservices hosts [dns] - 10https://gerrit.wikimedia.org/r/504490 (https://phabricator.wikimedia.org/T221183) [14:36:42] (03CR) 10Andrew Bogott: [C: 03+2] cloud: add new cloud-ns and cloud-recursor names to cloudservices hosts [dns] - 10https://gerrit.wikimedia.org/r/504490 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [14:36:58] !log ariel@deploy1001 Started deploy [dumps/dumps@dcf04a0]: fix up paths for 1.34_wmf.1 for AbstractFilter [14:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:02] !log ariel@deploy1001 Finished deploy [dumps/dumps@dcf04a0]: fix up paths for 1.34_wmf.1 for AbstractFilter (duration: 00m 04s) [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] (03PS2) 10Jbond: RAID: stop processing fact if device found via pci id [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) [14:37:26] (03PS1) 10Jbond: RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 [14:37:28] (03PS1) 10Volans: sre.hosts.downtime: update Phabricator task [cookbooks] - 10https://gerrit.wikimedia.org/r/504568 (https://phabricator.wikimedia.org/T221212) [14:38:11] (03CR) 10jerkins-bot: [V: 04-1] RAID: stop processing fact if device found via pci id [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:38:17] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:38:22] (03CR) 10jerkins-bot: [V: 04-1] RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 (owner: 10Jbond) [14:39:18] (03PS2) 10Jbond: RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 [14:40:07] (03PS3) 10Jbond: RAID: stop processing fact if device found via pci id [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) [14:40:27] (03PS3) 10Jbond: RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 [14:41:12] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:41:18] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:43:46] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:43:52] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:45:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, the respective RAID utils and support in the Icinga check have been removed two years ago in ed5077f6df8839273fc8032be6489323d" [puppet] - 10https://gerrit.wikimedia.org/r/504567 (owner: 10Jbond) [14:45:32] (03CR) 10CDanis: [C: 03+1] check_icinga: split configuration in two files (031 comment) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/503943 (owner: 10Volans) [14:46:44] (03PS1) 10Mathew.onipe: Add maps postgres init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:47:35] (03CR) 10Volans: [C: 03+2] "thanks for the review!" (031 comment) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/503943 (owner: 10Volans) [14:48:04] (03Merged) 10jenkins-bot: check_icinga: split configuration in two files [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/503943 (owner: 10Volans) [14:49:49] (03PS6) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [14:50:50] (03CR) 10CRusnov: "Thanks!" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [14:52:14] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [14:54:51] (03PS1) 10Alex Monk: tlsproxy: Ensure OCSP stapling nginx reload hook present for acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/504571 (https://phabricator.wikimedia.org/T221171) [14:56:22] (03PS2) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [14:56:45] (03PS1) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [14:59:00] RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [14:59:11] (03PS2) 10Alex Monk: tlsproxy: Ensure OCSP stapling nginx reload hook present for acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/504571 (https://phabricator.wikimedia.org/T221171) [15:03:37] (03CR) 10Hashar: "The rebase fails test since the module has since been converted to role/profile" [puppet] - 10https://gerrit.wikimedia.org/r/497253 (owner: 10Hashar) [15:03:54] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) Things should be ready for install, the dns issue delayed me a bit, and sadly it is a bit late to take care of the full installa... [15:04:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [15:05:36] (03PS2) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [15:06:47] (03PS1) 10Bearloga: profile::discovery_dashboards: remove Wikipedia Portal dashboard [puppet] - 10https://gerrit.wikimedia.org/r/504577 (https://phabricator.wikimedia.org/T197138) [15:07:38] (03PS2) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 [15:07:40] (03PS2) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [15:07:42] (03PS1) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 [15:09:15] 10Operations, 10Wikimedia-Logstash, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10herron) I'm for erring on the side of simplicity. Since these logs are useful on the command line of an individual host, on centrallog, and in K... [15:13:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10RobH) [15:14:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks! LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/504568 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:14:58] (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [15:15:15] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [15:15:18] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/15856/" [puppet] - 10https://gerrit.wikimedia.org/r/504571 (https://phabricator.wikimedia.org/T221171) (owner: 10Alex Monk) [15:15:22] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:15:43] (03PS3) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [15:15:45] (03PS1) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [15:15:50] (03CR) 10jerkins-bot: [V: 04-1] confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [15:18:07] (03CR) 10Volans: [C: 03+1] "LGTM, if you could double check with a compiler on different Gen HP hosts and a random md/megaraid one it would be great." [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [15:18:09] (03PS5) 10Hashar: shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 [15:18:26] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 183.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:18:40] (03CR) 10jerkins-bot: [V: 04-1] shinken: add basic spec [puppet] - 10https://gerrit.wikimedia.org/r/497253 (owner: 10Hashar) [15:18:54] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10RobH) a:05RobH→03Papaul Ok, this was neglected. This is now ready for @papaul to sercure erase the disks on all 4 maps-test systems. [15:20:10] (03CR) 10Hashar: "Which eventually fails to compile:" [puppet] - 10https://gerrit.wikimedia.org/r/497253 (owner: 10Hashar) [15:21:35] (03CR) 10Arturo Borrero Gonzalez: "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [15:21:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [15:23:04] jouncebot: now [15:23:04] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [15:23:06] jouncebot: next [15:23:06] In 0 hour(s) and 36 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T1600) [15:26:14] (03PS1) 10Elukey: cumin: update hadoop alias [puppet] - 10https://gerrit.wikimedia.org/r/504585 (https://phabricator.wikimedia.org/T218343) [15:26:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10RobH) [15:27:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10RobH) [15:27:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10RobH) [15:27:22] (03CR) 10CDanis: [C: 03+1] "How will we know when it fails? cron job sends email?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503945 (owner: 10Volans) [15:27:23] Reedy: I'm deploying SDC stuff. [15:27:50] Reedy: And merging the code alone is going to take ~45 minutes because of the Cirrus CI UBN I'm having to backport. :-( [15:28:02] Uhhhh [15:28:35] fun [15:28:43] Not so much. :-( [15:29:08] So far I've lost 20 minutes of merge time finding out which side of the branch cut the breakage landed (the wrong side). [15:30:44] (03CR) 10Jbond: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [15:30:55] (03PS4) 10Jbond: RAID: stop processing fact if device found via pci id [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) [15:31:37] (03PS3) 10Arturo Borrero Gonzalez: cumin: aliases: include codfw1dev openstack deployment [puppet] - 10https://gerrit.wikimedia.org/r/504309 [15:31:55] (03CR) 10Jbond: [C: 03+2] RAID: stop processing fact if device found via pci id [puppet] - 10https://gerrit.wikimedia.org/r/504554 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [15:32:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cumin: aliases: include codfw1dev openstack deployment [puppet] - 10https://gerrit.wikimedia.org/r/504309 (owner: 10Arturo Borrero Gonzalez) [15:32:45] (03PS4) 10Arturo Borrero Gonzalez: cumin: aliases: include codfw1dev openstack deployment [puppet] - 10https://gerrit.wikimedia.org/r/504309 [15:34:02] (03CR) 10Jbond: [C: 03+2] RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 (owner: 10Jbond) [15:34:16] (03PS4) 10Jbond: RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 [15:35:37] (03PS5) 10Jbond: RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 [15:35:40] (03CR) 10Jbond: [V: 03+2 C: 03+2] RAID: remove unused fact cases [puppet] - 10https://gerrit.wikimedia.org/r/504567 (owner: 10Jbond) [15:36:45] 10Operations, 10TechCom, 10Core Platform Team Backlog (Attic), 10Services (attic), 10User-mobrovac: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825 (10jcrespo) I made similar questions at https://www.mediawiki.org/wiki/Topic:Uxpkxrhzklew3ets but they were out of scope there,... [15:38:34] (03PS1) 10Jbond: RAID: replace hpssacli with sscli [puppet] - 10https://gerrit.wikimedia.org/r/504586 (https://phabricator.wikimedia.org/T220787) [15:40:58] (03PS1) 10BBlack: Revert "wiktionary: test with zone-local CNAME->DYNA" [dns] - 10https://gerrit.wikimedia.org/r/504587 (https://phabricator.wikimedia.org/T208263) [15:41:01] (03PS1) 10BBlack: wikipedia.org: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/504588 (https://phabricator.wikimedia.org/T208263) [15:42:09] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Jgreen) [15:45:18] (03PS9) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [15:45:39] * James_F sighs. The Cirrus patch hasn't even landed yet. [15:45:54] (03CR) 10jerkins-bot: [V: 04-1] raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [15:46:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Per T209879#5107782, this should not be merged/deployed until Id7795b4f5c has been merged and deployed with the train, and we’re sure the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503342 (https://phabricator.wikimedia.org/T220609) (owner: 10Lucas Werkmeister (WMDE)) [15:46:49] (03CR) 10Jbond: "I found an issue with the dsa script and updated the CR appropriately. I have also created the following CR as an alternate way forward" [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [15:47:14] (03PS1) 10Herron: lvs: switch kibana scheduler to source hash [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143) [15:48:21] (03PS10) 10Jbond: raid: add ssacli class [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) [15:49:44] Reedy: OK, your patch will land next (probably in ~5 minutes); want to test and deploy, and then I'll do the rest of mine once they've landed, as I do SWAT? [15:50:09] Oh, it's just landed. Nice. [15:50:53] CC _joe_ :-) [15:50:57] 10Operations, 10Wikimedia-Logstash, 10Security: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 (10jbond) sounds good [15:51:11] <_joe_> \o/ [15:51:30] I've put it live on mwdebug1002. [15:59:51] (03PS2) 10Elukey: cumin: update hadoop alias [puppet] - 10https://gerrit.wikimedia.org/r/504585 (https://phabricator.wikimedia.org/T218343) [16:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T1600). [16:00:04] davidwbarratt: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] here! [16:00:21] I'll SWAT [16:00:26] And I also just added a patch to the window [16:00:55] (03CR) 10Elukey: [C: 03+2] cumin: update hadoop alias [puppet] - 10https://gerrit.wikimedia.org/r/504585 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [16:01:07] RoanKattouw: No, please. I've got six patches landing right now, and Reedy and _joe_ are testing. [16:01:24] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:26] OK I can wait. Works better for me actually [16:01:32] davidwbarratt: I'll sling your config patch out now if that's OK? [16:01:40] Oh and you said you'd do the SWAT too, go for it :) [16:01:41] James_F sure! [16:02:10] (03PS2) 10Jforrester: Deploy Partial blocks to Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504403 (https://phabricator.wikimedia.org/T220434) (owner: 10Dbarratt) [16:02:17] (03CR) 10Jforrester: [C: 03+2] Deploy Partial blocks to Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504403 (https://phabricator.wikimedia.org/T220434) (owner: 10Dbarratt) [16:02:27] RoanKattouw: :-) [16:03:06] <_joe_> James_F: the patch should be a noop, so minus some issues in the logs, it's ok [16:03:25] (03Merged) 10jenkins-bot: Deploy Partial blocks to Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504403 (https://phabricator.wikimedia.org/T220434) (owner: 10Dbarratt) [16:05:15] (03PS2) 10Vgutierrez: wikiba.se TLS: Make support for different certificate sources clearer [puppet] - 10https://gerrit.wikimedia.org/r/501461 (owner: 10Alex Monk) [16:05:16] davidwbarratt: Live on mwdebug1002. [16:05:30] kk, let me test [16:06:00] _joe_: Won't the change to Language.php change behaviour in the real-world? [16:06:09] James_F looks good to me! [16:06:27] <_joe_> James_F: not unless I add an overrides array [16:06:33] <_joe_> which we didn't do [16:06:41] Oh, OK. In that case, never mind. [16:06:46] <_joe_> that's why *this* was safe :P [16:07:00] <_joe_> the mwconfig patch I'm gonna merge tomorrow, OTOH, will not [16:07:14] <_joe_> but I'll just do it on testwiki [16:07:20] * James_F nods. [16:08:17] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T220434 Deploy Partial blocks to Chinese Wikipedia (duration: 01m 02s) [16:08:41] davidwbarratt: Congratulations. [16:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:25] T220434: Deploy Partial blocks to Chinese Wikipedia - https://phabricator.wikimedia.org/T220434 [16:10:33] _joe_ thanks! [16:10:57] <_joe_> uh, you're welcome, but what did I do? [16:10:59] <_joe_> :P [16:11:52] !log set fasw-c-eqiad:ge-[0-1]/0/17 in admin vlan - T221232 [16:11:53] (03CR) 10jenkins-bot: Deploy Partial blocks to Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504403 (https://phabricator.wikimedia.org/T220434) (owner: 10Dbarratt) [16:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:56] T221232: configure switch ports for frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T221232 [16:12:06] (03PS1) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [16:12:43] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [16:12:49] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10ayounsi) [16:12:53] 10Operations, 10fundraising-tech-ops, 10netops: configure switch ports for frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T221232 (10ayounsi) 05Open→03Resolved a:03ayounsi [16:13:01] <_joe_> James_F: did you deploy my change already to the whole cluster? [16:13:06] !log jforrester@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [16:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:28] <_joe_> or is it just on mwdebug1002? [16:13:32] Eurgh, no. That scap went poorly. [16:13:41] _joe_: I put the whole think on mwdebug1002. [16:13:57] _joe_: I was just putting the important bits on all-prod, but that broke. [16:14:09] <_joe_> why did it broke? [16:14:25] <_joe_> let's see the logs from the canaries [16:14:27] Looking now. [16:14:34] <_joe_> [{exception_id}] {exception_url} ErrorException from line 2740 of /srv/mediawiki/php-1.34.0-wmf.1/languages/Language.php: PHP Warning: Invalid operand type was used: array_key_exists expects an array or an object; false returned. [16:14:39] <_joe_> oh crap [16:14:44] <_joe_> well I can fix that [16:15:00] Ah, yes. [16:15:00] <_joe_> geez I did it right and someone made me "fix" it :P [16:15:10] * James_F tsks. :-) [16:15:13] <_joe_> ok so, let's revert in .1 [16:15:13] (03PS2) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [16:15:19] Okie-dokie. [16:15:36] <_joe_> and we can backport both patches [16:15:41] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [16:16:45] <_joe_> but tbh [16:16:52] <_joe_> I'm not sure why that exception arises [16:16:55] (03PS2) 10Dzahn: DNS: Remove mgmt DNS for labtestcontrol2001 [dns] - 10https://gerrit.wikimedia.org/r/504564 (owner: 10Papaul) [16:17:07] (03PS3) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [16:17:17] Oh, I probably should have synced DefaultSettings first. [16:17:20] Let's try that. [16:17:25] <_joe_> James_F: yes you need to [16:17:28] <_joe_> that's what happened [16:17:33] <_joe_> I was looking back at my patch [16:17:39] <_joe_> and it was correct :) [16:17:56] <_joe_> yes, DefaultSettings always goes first [16:18:06] _joe_: This is not the kind of patch I would generally recommend back-porting to prod. [16:18:47] "Always" except most of the time we back-port, when the change of behaviour needs the code to be updated first. :-) [16:18:47] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/includes/DefaultSettings.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1b (duration: 01m 03s) [16:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:52] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [16:18:56] <_joe_> James_F: eheh indeed [16:20:42] (03PS4) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [16:20:58] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/languages/Language.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1 try two (duration: 01m 00s) [16:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:24] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Papaul) a:03Papaul [16:22:31] (03PS5) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [16:22:36] <_joe_> James_F: ok I don't see errors anymore [16:22:36] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS for labtestcontrol2001 [dns] - 10https://gerrit.wikimedia.org/r/504564 (owner: 10Papaul) [16:22:46] _joe_: Yeah, should all be done and quiet. [16:23:00] <_joe_> thanks <3 [16:23:05] Any time. [16:23:23] (03PS7) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [16:23:33] <_joe_> James_F: uhm you didn't deploy what's under maintenance/language, right? I can do it later [16:23:57] (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [16:26:08] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:02] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:36] looking at that [16:27:50] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:28:35] Telia transport [16:29:31] (03PS12) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:29:33] (03PS1) 10Jforrester: SDC: Enable Wikidata federation on Commons again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504608 (https://phabricator.wikimedia.org/T214075) [16:29:53] _joe_: I didn't, no. I can do that if you want. [16:30:16] <_joe_> thanks :) [16:33:28] RoanKattouw: Still waiting for code to land, sorry. :-( [16:35:08] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) [16:36:20] Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/NavigationTiming/+/504525 seems to be merged but not deployed on wmf.1? [16:38:22] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/maintenance/language/generateUcfirstOverrides.php: Maintenance script for _joe_ (duration: 01m 00s) [16:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:01] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) time frame 16:27 UTC, 12:27 PST: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down 12:27 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqor... [16:39:31] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/maintenance/language/generateUpperCharTable.php: Maintenance script for _joe_ (duration: 00m 59s) [16:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:39] <_joe_> thanks :) [16:39:58] I guess you'll need the autoloader too. [16:40:16] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) Indeed, just got a notification: "We have an outage which is suspected to be caused by a cable fault. Our NOC is investigating and activating local resources. We will provide more informati... [16:41:09] _joe_: OK, everything except the tests/ directory now scap'ed. [16:41:18] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/autoload.php: Update to point to new maintenance scripts (duration: 01m 00s) [16:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:03] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:46:13] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:47:40] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/WikibaseMediaInfo/: SDC: Various fixes T218922 T221071 T221110 T221123 (duration: 01m 02s) [16:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:50] T221123: Adding multiple captions at once is v broken (Beta Commons) - https://phabricator.wikimedia.org/T221123 [16:47:50] T218922: SDC: The tabs end up in the wrong place in "view" non-view pages like diffs - https://phabricator.wikimedia.org/T218922 [16:47:51] T221110: Cannot edit statements for certain images - https://phabricator.wikimedia.org/T221110 [16:47:51] T221071: We don't want MediaInfo content to show up on wikis sharing Commons files - https://phabricator.wikimedia.org/T221071 [16:49:04] !log deleting three files for legal compliance [16:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:33] (03PS1) 10Thcipriani: gerrit: lower sshd threadpool size [puppet] - 10https://gerrit.wikimedia.org/r/504611 (https://phabricator.wikimedia.org/T221026) [16:51:13] 10Operations: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10herron) To give a few real-world examples ` bast1002:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1 all-users ops ` ` elastic1017:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -... [16:51:31] (03CR) 10Paladox: [C: 03+1] gerrit: lower sshd threadpool size [puppet] - 10https://gerrit.wikimedia.org/r/504611 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [16:53:26] (03CR) 10Filippo Giunchedi: "+Luca for aqs" [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [16:55:54] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Papaul) ` papaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/8 descriptions Interface Admin Link Description ge-0/0/8 up up betelge... [16:56:32] PROBLEM - High lag on wdqs1010 is CRITICAL: 2.478e+06 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:57:15] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Papaul) [17:00:36] 10Operations, 10Growth-Team, 10Notifications: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10jbond) p:05Triage→03Normal [17:02:10] (03PS1) 10Dzahn: remove prod entries for betelgeuse.frack [dns] - 10https://gerrit.wikimedia.org/r/504615 (https://phabricator.wikimedia.org/T206870) [17:02:59] (03CR) 10Bearloga: [C: 04-1] "TODO: need to double check with Deb about this" [puppet] - 10https://gerrit.wikimedia.org/r/504577 (https://phabricator.wikimedia.org/T197138) (owner: 10Bearloga) [17:03:26] RoanKattouw: Around to test? Sorry. [17:03:34] Yeah [17:03:38] (03CR) 10Dzahn: [C: 03+2] remove prod entries for betelgeuse.frack [dns] - 10https://gerrit.wikimedia.org/r/504615 (https://phabricator.wikimedia.org/T206870) (owner: 10Dzahn) [17:03:50] (03CR) 10Dzahn: [C: 03+2] "switch port deactivated by papaul" [dns] - 10https://gerrit.wikimedia.org/r/504615 (https://phabricator.wikimedia.org/T206870) (owner: 10Dzahn) [17:04:01] RoanKattouw: Live on mwdebug1002. [17:04:22] (But will the job you trigger execute on debug?) [17:04:57] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission betelgeuse.frack.codfw.wmnet - https://phabricator.wikimedia.org/T206870 (10Dzahn) [17:06:41] Maybe? We'll see [17:06:46] I just sent Stephane a message [17:07:09] (03CR) 10Dzahn: [C: 03+2] gerrit: lower sshd threadpool size [puppet] - 10https://gerrit.wikimedia.org/r/504611 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [17:07:21] You were right [17:07:29] It's just a revert, so let's sync it and then we'll test again [17:07:40] Should be pretty safe since we're just reverting to code that's been used for ~3 years [17:07:44] ( James_F ) [17:08:11] RoanKattouw: Yeah, syncing now. [17:08:31] (03PS1) 10Ottomata: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504618 [17:08:58] Then I've got the 2FA un-breakage, and then - oh boy! - I've got a config patch for federation on Commons, which last time broke the world. Three hours of deployments isn't fun. [17:09:01] thanks mutante ! [17:09:05] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/Echo/includes/formatters/: Notifications: Revert 7121b9c4 per I8f9a6a19ba (duration: 01m 01s) [17:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:17] (03PS2) 10Ottomata: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504618 (https://phabricator.wikimedia.org/T214080) [17:10:16] thcipriani: puppet just ran on cobalt [17:10:19] and yw [17:10:57] (03CR) 10Hashar: [C: 03+1] "Non-Interactive Users: https://gerrit.wikimedia.org/r/#/admin/groups/4,members" [puppet] - 10https://gerrit.wikimedia.org/r/504611 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [17:11:23] btw, i hope we can reinstall cobalt and make it gerrit1001 soonish [17:11:31] enough with the misc names :) [17:11:47] 10Operations, 10Growth-Team, 10Notifications: Discussion: Explore push notifications options - https://phabricator.wikimedia.org/T221265 (10jbond) [17:12:25] mutante: Boo, you have no soul. ;-) [17:12:54] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [17:12:58] James_F: lol, i just got used to it :) [17:13:18] it's more about the OS version :) [17:13:30] (03PS2) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [17:13:53] Yeah, fair. Buster here we come! Oh, wait. ;-) [17:14:07] lol [17:14:21] (03CR) 10Ppchelko: [C: 03+1] Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504618 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [17:14:25] eyeroll [17:14:42] Pchelolo: ready? [17:14:53] busted [17:15:08] ottomata: ready [17:15:18] James_F: Thanks, notif icons working again now [17:15:49] (03CR) 10Ottomata: [C: 03+2] Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504618 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [17:15:54] (03PS3) 10Ottomata: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504618 (https://phabricator.wikimedia.org/T214080) [17:15:57] RoanKattouw: Cool. [17:18:30] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/OATHAuth/: UBN T221257 train un-blocker (duration: 01m 02s) [17:18:31] !log LDAP - added 'brennen' to group 'gerritadmin' (T218858) [17:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:35] T221257: 2FA broken on mediawiki.org - https://phabricator.wikimedia.org/T221257 [17:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:39] T218858: Bless Brennen with Gerrit administrator rights - https://phabricator.wikimedia.org/T218858 [17:18:40] OK, SWAT's now finally over, I'm releasing the conch. [17:18:42] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 106.4 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [17:19:29] Pchelolo, ottomata: Go ahead. [17:19:31] scap syning now Pchelolo [17:19:32] (03CR) 10jenkins-bot: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504618 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [17:19:39] thanks James_F [17:19:41] brennen: welcome to gerrit admins [17:20:05] thank you, mutante [17:20:24] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enabling EventGate api-request logging on group1 wikis (duration: 01m 00s) [17:20:30] (although if recent experience is any guide, i may come to regret the impulse. :) ) [17:20:49] (03CR) 10Elukey: [C: 03+1] "LGTM if the inclusion of a role into another is fine! (I mean for our puppet guidelines)" [puppet] - 10https://gerrit.wikimedia.org/r/504538 (https://phabricator.wikimedia.org/T221226) (owner: 10Jbond) [17:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:46] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 2.475e+06 ge 3600 daniel_zahn https://phabricator.wikimedia.org/T209201 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:24:44] ACKNOWLEDGEMENT - toolschecker: gridengine webservice running on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 351 bytes in 0.479 second response time daniel_zahn scheduled downtime https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [17:25:31] I got paged for the toolschecker ack and there was no critical first? [17:26:07] apergos: for some reason it was both in scheduled downtime but also status "unhandled" . i dont know why, that's normally not the case [17:26:15] normally downtime means handled [17:26:21] anyways, this explains what you saw [17:26:26] huh [17:26:33] weird [17:27:23] hmm... could have been some strange state from the downtime we put on those toolschecker checks a couple of weeks ago I guess. [17:27:54] so far so good Pchelolo ! [17:27:59] yup yup [17:30:06] PROBLEM - Host labvirt1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:30:07] @seen fsalutari [17:30:07] mutante: I have never seen fsalutari [17:30:32] another random eqiad mgmt host going down [17:31:21] cmjohnson1: labvirt1016.mgmt went down - this time i actually can't connect to it though, as opposed to yesterday's event [17:33:41] 10Operations, 10Traffic: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) [17:33:49] 10Operations, 10Traffic: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) p:05Triage→03Low [17:35:01] 10Operations, 10Traffic: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) ` modules/archiva/manifests/proxy.pp: # regsubst is needed due to letsencrypt::cert::integrated's naming modules/profile/manifests/gerrit/server.pp: letsencrypt::cert::integra... [17:35:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson) [17:36:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson) @andrewbogott: all the on-site work is finished, if you can do raid cfg and dhcp file then it should work [17:37:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:37:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Cmjohnson) [17:37:21] 10Operations, 10Traffic: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Paladox) I use " modules/profile/manifests/gerrit/server.pp: letsencrypt::cert::integrated { 'gerrit':" for gerrit.git.wmflabs.org and gerrit.gerrit.wmflabs.org as the acme service does not work in WMCS... [17:37:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Cmjohnson) @andrewbogott: all the on-site work is finished, if you can do raid cfg and dhcp file then it should work I will remove old switch data on... [17:38:03] 10Operations, 10Traffic: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) It does work in WMCS, with some puppet cherry-picks and some credentials generated by WMCS admins to allow modification of designate DNS records from within instances. [17:39:38] RECOVERY - Long running screen/tmux on prometheus2004 is OK: OK: SCREEN detected but not long running. [17:43:50] ottomata, Pchelolo: OK for me to deploy a config change? [17:43:59] yes, James_F we are done [17:44:01] thanks! [17:44:11] Cool. [17:44:34] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: update Phabricator task [cookbooks] - 10https://gerrit.wikimedia.org/r/504568 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [17:44:39] (03CR) 10Jforrester: [C: 03+2] SDC: Enable Wikidata federation on Commons again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504608 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [17:46:09] (03PS1) 10Andrew Bogott: Fix order of cloud* entries [dns] - 10https://gerrit.wikimedia.org/r/504623 [17:46:16] (03Merged) 10jenkins-bot: sre.hosts.downtime: update Phabricator task [cookbooks] - 10https://gerrit.wikimedia.org/r/504568 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [17:47:22] (03CR) 10Volans: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503945 (owner: 10Volans) [17:47:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:48:05] (03PS2) 10Andrew Bogott: Fix order of cloud* entries [dns] - 10https://gerrit.wikimedia.org/r/504623 [17:48:46] (03PS2) 10Jforrester: SDC: Enable Wikidata federation on Commons again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504608 (https://phabricator.wikimedia.org/T214075) [17:48:52] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504608 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [17:49:05] (03CR) 10Andrew Bogott: [C: 03+2] Fix order of cloud* entries [dns] - 10https://gerrit.wikimedia.org/r/504623 (owner: 10Andrew Bogott) [17:49:59] (03Merged) 10jenkins-bot: SDC: Enable Wikidata federation on Commons again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504608 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [17:50:14] (03CR) 10jenkins-bot: SDC: Enable Wikidata federation on Commons again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504608 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [17:52:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Cmjohnson) [17:53:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Cmjohnson) @andrewbogott same as cloudvirt1001/1002. [17:57:41] (03PS1) 10Jgreen: add frav1002.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504627 [17:58:26] jouncebot: now [17:58:26] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [17:58:36] thcipriani: Just syncing. [17:58:37] a whole minute [17:58:51] Then it's the train-wait for an hour. [17:59:04] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Enable Wikidata federation on Commons again T214075 (duration: 01m 00s) [17:59:04] (03PS2) 10Jgreen: add frav1002.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504627 (https://phabricator.wikimedia.org/T143900) [17:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:09] T214075: Enable federated access to entities and properties from Wikidata to Commons - https://phabricator.wikimedia.org/T214075 [17:59:11] OK, I'm done. Hopefully. [17:59:19] cool [17:59:24] I want to do a gerrit restart so that it picks up some new config values [17:59:32] Cool. Go for it. [17:59:39] (03CR) 10Jgreen: [C: 03+2] add frav1002.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504627 (https://phabricator.wikimedia.org/T143900) (owner: 10Jgreen) [17:59:42] * thcipriani does [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T1800) [18:01:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson) MAC F0:92:1C:05:F5:70 [18:01:27] !log gerrit restart for https://gerrit.wikimedia.org/r/504611/ [18:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:20] !log gerrit back [18:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:08] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [18:06:12] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:07:06] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [18:07:28] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:08:08] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [18:08:32] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [18:08:47] hrm [18:08:52] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:10:46] chaomodus that's due to the gerrit restart. [18:10:53] Should recover :) [18:11:01] oic :) [18:14:17] (03PS5) 10Alex Monk: acme_chief: Add security::access::config on passive host in cloud [puppet] - 10https://gerrit.wikimedia.org/r/497430 [18:15:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson) [18:15:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson) This is ready for you @andrewbogott [18:23:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Cmjohnson) MAC F0:92:1C:05:F5:20 [18:31:47] (03PS1) 10Ottomata: Add linux-host-entries and netboot/partman for schemaa[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/504646 (https://phabricator.wikimedia.org/T219556) [18:32:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) [18:32:32] (03PS2) 10Ottomata: Add linux-host-entries and netboot/partman for schema[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/504646 (https://phabricator.wikimedia.org/T219556) [18:32:40] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:33:17] (03PS3) 10Ottomata: Add linux-host-entries and netboot/partman for schema[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/504646 (https://phabricator.wikimedia.org/T219556) [18:33:32] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:33:54] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:34:31] (03CR) 10Ottomata: [C: 03+2] Add linux-host-entries and netboot/partman for schema[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/504646 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [18:34:38] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:34:58] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:35:10] (03PS3) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [18:35:12] (03PS4) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [18:35:14] (03PS1) 10Andrew Bogott: cloudvirts: update mac addresses for hosts moved to 10G [puppet] - 10https://gerrit.wikimedia.org/r/504647 [18:35:18] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:35:34] (03PS2) 10Andrew Bogott: cloudvirts: update mac addresses for hosts moved to 10G [puppet] - 10https://gerrit.wikimedia.org/r/504647 [18:36:16] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: update mac addresses for hosts moved to 10G [puppet] - 10https://gerrit.wikimedia.org/r/504647 (owner: 10Andrew Bogott) [18:37:54] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:39:16] andrewbogott: i ran puppet on install servers and saw your cloudvirt maac change [18:39:32] ottomata: great, that's why I'm not seeing it :) [18:40:13] (03PS1) 10Alex Monk: archiva::proxy: remove old letsencrypt module stuff [puppet] - 10https://gerrit.wikimedia.org/r/504648 (https://phabricator.wikimedia.org/T221268) [18:47:25] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10colewhite) p:05Triage→03High [18:47:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10colewhite) p:05Triage→03Normal [18:48:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10colewhite) p:05Triage→03Normal [18:48:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10colewhite) p:05Triage→03Normal [18:48:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10colewhite) p:05Triage→03Normal [18:51:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10Cmjohnson) [18:51:57] (03CR) 10Paladox: "This should be updated to match the latest change at https://gerrit-review.googlesource.com/c/gerrit/+/220263 (same for the 2.16 change)" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/502487 (owner: 10DCausse) [18:56:03] I would like some help. I have a query that looks for people, but sometimes it also includes other pages as well. I have missed something but don't know what. [18:56:15] Here is my query: https://en.wikipedia.org/w/api.php?format=json&action=query&formatversion=2&generator=prefixsearch&gpssearch=putin&gpslimit=10&prop=pageimages|pageterms|categories&piprop=thumbnail&pithumbsize=96&pilimit=10&wbptterms=description&gpsprofile=strict&cllimit=max&clcategories=Category:Living%20people [19:00:04] twentyafterfour: Dear deployers, time to do the MediaWiki train - Americas version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T1900). [19:00:24] marlon_is_home: you could try this full text search query: incategory:"Living People" putin [19:00:40] 10Operations, 10Traffic, 10Patch-For-Review: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) [19:02:24] ebernhardson: What endpoint should I be calling and what parameters? [19:02:46] marlon_is_home: like this: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=putin%20incategory%3A%22Living%20People%22 [19:03:36] marlon_is_home: if you don't want results like "Lyudmila Putina" you can also put putin in quotes and it wont do the stemming that makes putin and putina the same word [19:03:43] (03PS1) 10Alex Monk: maintain-replicas: Add ipb_sitewide field to ipblocks and ipblocks_ipindex [puppet] - 10https://gerrit.wikimedia.org/r/504653 (https://phabricator.wikimedia.org/T221272) [19:04:08] marlon_is_home: hmm, actually i guess putin is mentioned in the article so it would still match [19:05:06] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143) (owner: 10Herron) [19:05:11] ebernhardson: Thank you so much, I have manage to do this myself as well. But the tricky part was to include the image of the person as well. That's the only thing I couldn't get to work. [19:05:26] (03CR) 10Dbarratt: [C: 03+1] "Looks good to me, although, I have no idea what I'm looking at. :)" [puppet] - 10https://gerrit.wikimedia.org/r/504653 (https://phabricator.wikimedia.org/T221272) (owner: 10Alex Monk) [19:06:05] (03CR) 10Cwhite: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond) [19:07:03] And as you said, I only want people whos name is matched, and not including search in snippet :) [19:07:33] marlon_is_home: what do you mean by not including search in snippet? Search is performed against the full text content, do you only want pages with titles/redirects containing putin? [19:08:51] Correct, sorry for being unclear. I want pages with titles/redirects containing putin. [19:09:22] (03PS1) 1020after4: group1 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504655 [19:09:26] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504655 (owner: 1020after4) [19:09:29] marlon_is_home: in that case you want something like this: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=pageimages&generator=search&piprop=thumbnail&pilimit=50&gsrsearch=intitle%3Aputin%20incategory%3A%22Living%20people%22&gsrlimit=50 [19:10:14] marlon_is_home: the intitle keyword limits to titles and redirects [19:10:29] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504655 (owner: 1020after4) [19:10:32] (03PS13) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [19:10:42] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504655 (owner: 1020after4) [19:12:29] (03CR) 10Cwhite: [C: 03+1] "It looks like most if not all of the feedback has been handled. Volans might have more but the latest patchset looks good to me." [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [19:12:50] There is a strange behaviour in the search. I can't figure out why irrelevant matches appear like "Randy Newman". He's only connection was that he made a song named "Putin" [19:13:03] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.1 refs T220726 [19:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:18] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [19:14:10] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10serviceops, and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) >>! In T219279#5096304, @Joe wrote: >>>! In T219279#5095261... [19:14:25] ebernhardson: Maybe there isn't away around it. [19:14:53] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.1 refs T220726 (duration: 01m 49s) [19:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:16] marlon_is_home: if you look at https://en.wikipedia.org/wiki/Randy_Newman?action=cirrusdump that dumps the internal representation search uses [19:15:51] marlon_is_home: we can see in the redirect: field there is a redirect called 'Putin (song)' and 'Putin (Randy Newman song)'. So editors have decided putin is a good title for that page [19:16:26] 10Operations, 10Core Platform Team Kanban, 10MediaWiki-General-or-Unknown, 10serviceops, and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:16:53] marlon_is_home: it sounds like you might be looking for a more structured information query, perhaps wikidata and the wikidata query service would be more appropriate? [19:17:03] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:17:05] marlon_is_home: it's more indirect, and harder to use though [19:18:06] ebernhardson: I understand, thank you so much for your help. I'll try to find other solutions through wikidata. Have a nice evening. [19:21:37] (03PS1) 10Dzahn: icinga/base/prometheus: add notes_url to DISK space checks [puppet] - 10https://gerrit.wikimedia.org/r/504658 (https://phabricator.wikimedia.org/T220326) [19:21:44] marlon_is_home: perhaps this kind of query would work: https://www.wikidata.org/w/index.php?search=inlabel%3Aputin%40en+haswbstatement%3AP31%3DQ5&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns120=1 [19:22:30] ebernhardson, marlon_is_home: Might be better to take this to #wikimedia-tech if it's not directly relevant to production deployments. [19:22:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:23:40] uhm [19:23:58] well that doesn't look good [19:24:02] ^ [19:24:55] there is usually a spike after deploying the train recently but this one doesn't seem to be subsiding [19:26:03] I guess we need to roll back [19:26:06] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) [19:26:16] they are all 60 second request timeout errors [19:26:25] no particular pattern to point to a culprit [19:26:26] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) 05Open→03Resolved Runs good! [19:26:46] twentyafterfour: well, the graph does seem to be going down though now [19:27:02] and as you said we saw that kind of spike during deploy. right [19:27:58] yeah maybe not rollback worthy [19:28:20] these timeouts sure do make it difficult for me to determine when to roll back [19:28:53] it seems to get better in the last 8 min [19:28:57] yeah [19:29:18] I guess it's just hhvm opcache getting dumped, same old problem [19:30:22] * twentyafterfour is hoping php7 solves all of my problems for me [19:30:51] Ha. [19:31:08] If only. [19:31:17] It'll just replace them with different ones. [19:31:29] brand new problems, woohoo! :P [19:31:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:31:53] heh yeah [19:32:25] even though the error rate is mostly recovered now, I still see a ton of this specific timeout: Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.34.0-wmf.1/extensions/Scribunto/includes/engines/LuaSandbox/Engine.php on line 282 [19:33:18] Thank you James_F [19:33:54] 10Operations, 10monitoring, 10Patch-For-Review: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Dzahn) @Marostegui @fgiunchedi I made this new "landing page" for disk space checks and from there i linked to the Prometheus runbook: https://wikitech.wikimedia.or... [19:34:08] twentyafterfour: Hmm. Could be an unfortunately-timed edit to a major Lua module. [19:34:39] (03CR) 10Dzahn: [C: 03+2] icinga/base/prometheus: add notes_url to DISK space checks [puppet] - 10https://gerrit.wikimedia.org/r/504658 (https://phabricator.wikimedia.org/T220326) (owner: 10Dzahn) [19:36:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Cmjohnson) MAC F0:92:1C:05:4A:98 [19:38:21] (03PS9) 10Alex Monk: acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [19:38:41] https://logstash.wikimedia.org/goto/eb7219cf48b7e0b83e191a254696c7dd [19:40:57] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/504658 (https://phabricator.wikimedia.org/T220326) (owner: 10Dzahn) [19:44:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) ganeti1001, 1002 and 2001 have been installed. I dunno what's up with ganeti2002. `gnt-instance console schema2002.cod... [19:46:27] PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:30] RECOVERY - puppet last run on alcyone is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:52:04] (03PS4) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [19:52:06] (03PS5) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [19:52:08] (03PS1) 10Andrew Bogott: cloudvirts: update more MAC addresses for 10gb move [puppet] - 10https://gerrit.wikimedia.org/r/504661 [19:52:10] (03PS1) 10Andrew Bogott: site.pp: make cloudvirt1001-1007 virt nodes [puppet] - 10https://gerrit.wikimedia.org/r/504662 [19:56:29] moritzm, what's the status of https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy#Policy_proposal becoming an approved policy? [19:59:09] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: update more MAC addresses for 10gb move [puppet] - 10https://gerrit.wikimedia.org/r/504661 (owner: 10Andrew Bogott) [19:59:20] (03CR) 10Andrew Bogott: [C: 03+2] site.pp: make cloudvirt1001-1007 virt nodes [puppet] - 10https://gerrit.wikimedia.org/r/504662 (owner: 10Andrew Bogott) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T2000). [20:00:45] no parsoid deploy today [20:04:51] 10Operations, 10User-fgiunchedi: jessie rsyslog upgrade problems - https://phabricator.wikimedia.org/T219764 (10Krenair) [20:07:23] so after some time to recover, https://grafana.wikimedia.org/d/000000438/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen it looks like the error rate increased but I can't point to any specific code change [20:10:36] yeah error rate definitely increased [20:11:33] ACKNOWLEDGEMENT - Host labvirt1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T221284 [20:11:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson) [20:12:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Cmjohnson) raid updated [20:12:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Cmjohnson) [20:12:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Cmjohnson) raid updated [20:13:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson) [20:13:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Cmjohnson) raid updated [20:15:54] 10Operations, 10monitoring, 10Patch-For-Review: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Dzahn) The specific check this was about, disk on prometheus 1004, now has the Icinga link: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=prometheus1004... [20:17:25] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10Dzahn) [20:17:38] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Peachey88) [20:17:54] 10Puppet, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) 05Open→03Resolved I'm going to go ahead and assume tools-elastic*... [20:17:56] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10Dzahn) 05Open→03Resolved [20:21:21] 10Operations, 10ops-eqiad, 10cloud-services-team: labvirt1006.mgmt DOWN - https://phabricator.wikimedia.org/T221284 (10Dzahn) [20:28:46] 10Operations, 10ops-eqiad, 10cloud-services-team: labvirt1006.mgmt DOWN - https://phabricator.wikimedia.org/T221284 (10Dzahn) 05Open→03Invalid it's being renamed to cloudvirt currently [20:29:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Cmjohnson) [20:31:31] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) 05Open→03Resolved Calling this resolved for now -- the mtail-based events are also being monitored by Icinga, and would have caught all the previous inst... [20:34:22] 10Operations, 10DNS, 10Mail, 10Traffic: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10Krenair) [20:37:37] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10Cmjohnson) @fgiunchedi do you want to power off unplug and power on...that will clear the issue [20:42:29] 10Operations, 10DNS, 10Mail, 10Traffic: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10Krenair) [20:43:04] (03PS1) 10Thcipriani: Gerrit 2.15.12 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/504774 [20:43:30] 10Operations, 10DNS, 10Mail, 10Traffic: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10Krenair) {T216714} may be related here [20:47:34] (03PS1) 10Ladsgroup: admin: Remove my access from ores [puppet] - 10https://gerrit.wikimedia.org/r/504776 [20:47:50] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10bd808) [20:48:52] RECOVERY - Host labvirt1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [20:49:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Cmjohnson) MAC: F0:92:1C:05:2A:60 [20:51:05] (03PS1) 10Thcipriani: gerrit: enable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/504778 [20:51:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Cmjohnson) [20:52:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10Cmjohnson) @Andrew All Yours! [20:53:10] (03CR) 10Dzahn: [C: 04-1] "there's a spelling issue. initiatives vs. iniciatives" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [20:53:11] Krenair: we'll probably formally put in the next months, but I wouldn't expect further changes per se [20:54:00] moritzm, ack, thanks [20:56:48] (03PS1) 10Dzahn: icinga: remove Madhuvishy from command privileges [puppet] - 10https://gerrit.wikimedia.org/r/504779 [20:57:06] (03CR) 10Dzahn: [C: 03+2] gerrit: enable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/504778 (owner: 10Thcipriani) [20:57:47] hmm [20:57:53] that needs gerrit 2.15.12 [20:58:21] paladox: thcipriani said he will re-upgrade to gerrit 2.15.12 [20:58:28] ah ok [20:58:32] ignore me then :) [20:58:43] thcipriani: ^ and merged on master [20:58:56] mutante: great :) [20:59:07] (03CR) 10Paladox: [C: 03+2] Gerrit 2.15.12 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/504774 (owner: 10Thcipriani) [20:59:23] (03CR) 10Thcipriani: [V: 03+2] Gerrit 2.15.12 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/504774 (owner: 10Thcipriani) [21:00:08] (03PS1) 10Alex Monk: site: Only include ::nginx if ensure present [puppet/nginx] - 10https://gerrit.wikimedia.org/r/504781 (https://phabricator.wikimedia.org/T216164) [21:00:29] (03PS5) 10Andrew Bogott: cloud VMS: move primary resolvers to cloud-recursor0/1 [puppet] - 10https://gerrit.wikimedia.org/r/504580 (https://phabricator.wikimedia.org/T221183) [21:00:32] (03PS6) 10Andrew Bogott: cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) [21:00:33] (03PS1) 10Andrew Bogott: Update mac address for cloudvirt1006 [puppet] - 10https://gerrit.wikimedia.org/r/504782 [21:00:35] (03CR) 10Dzahn: [C: 04-1] "there is different spelling, initiatives vs. iniciatives" [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [21:02:15] (03CR) 10Andrew Bogott: [C: 03+2] Update mac address for cloudvirt1006 [puppet] - 10https://gerrit.wikimedia.org/r/504782 (owner: 10Andrew Bogott) [21:03:59] (03CR) 10Andrew Bogott: [C: 03+1] icinga: remove Madhuvishy from command privileges [puppet] - 10https://gerrit.wikimedia.org/r/504779 (owner: 10Dzahn) [21:04:35] (03CR) 10Dzahn: [C: 03+2] icinga: remove Madhuvishy from command privileges [puppet] - 10https://gerrit.wikimedia.org/r/504779 (owner: 10Dzahn) [21:04:51] (03PS2) 10Dzahn: icinga: remove Madhuvishy from command privileges [puppet] - 10https://gerrit.wikimedia.org/r/504779 [21:06:27] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@4dcb851]: Gerrit update (gerrit2001 only) [21:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:38] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@4dcb851]: Gerrit update (gerrit2001 only) (duration: 00m 11s) [21:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:02] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@4dcb851]: Gerrit update (cobalt -- restart incoming) [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:13] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@4dcb851]: Gerrit update (cobalt -- restart incoming) (duration: 00m 10s) [21:07:14] (03CR) 10Dzahn: [C: 03+2] icinga: let BryanDavis issue commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/504482 (https://phabricator.wikimedia.org/T220887) (owner: 10Dzahn) [21:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:14] !log gerrit back [21:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:53] (03PS1) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [21:12:19] (03PS2) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [21:12:55] (03PS3) 10Urbanecm: Prepare initial configuration for iniciativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) [21:13:23] (03PS3) 10Urbanecm: Add DNS entries for initiativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) [21:13:33] (03PS4) 10Urbanecm: Prepare initial configuration for initiativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) [21:13:36] PROBLEM - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:13:45] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review-1" [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [21:13:48] PROBLEM - Mediawiki Cirrussearch update lag - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:13:54] (03PS3) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [21:13:58] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [21:14:54] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:14:57] (03PS2) 10Dzahn: icinga: let BryanDavis issue commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/504482 (https://phabricator.wikimedia.org/T220887) [21:15:36] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [21:15:38] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [21:15:41] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [21:15:45] (03PS4) 10Ottomata: Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) [21:15:47] (03CR) 10Dzahn: [C: 03+2] icinga: let BryanDavis issue commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/504482 (https://phabricator.wikimedia.org/T220887) (owner: 10Dzahn) [21:16:10] (03CR) 10Ottomata: "To be bikeshed:" [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [21:16:45] (03CR) 10jerkins-bot: [V: 04-1] Puppetize schema.wikimedia.org and refactor eventschemas module [puppet] - 10https://gerrit.wikimedia.org/r/504787 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [21:17:06] 10Operations, 10SRE-Access-Requests, 10monitoring, 10Patch-For-Review: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Dzahn) @bd808 Since the merge above you should now be able to schedule any host for downtime, be it wmcs or not. [21:17:27] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [21:17:57] 10Operations, 10SRE-Access-Requests, 10monitoring, 10Patch-For-Review: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Dzahn) a:05Dzahn→03bd808 Wanna try with something random to schedule a short downtime? [21:27:15] (03PS1) 10Dzahn: site/mw: assign spare mw1297,mw1298 as API servers [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) [21:30:16] (03PS2) 10Dzahn: site/mw/conftool: assign spare mw1297,mw1298 as API servers [puppet] - 10https://gerrit.wikimedia.org/r/504791 (https://phabricator.wikimedia.org/T192457) [21:30:28] !log enable option-82 on asw2-b:cloud-hosts1-b-eqiad vlan [21:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:46] (03PS1) 10Dzahn: conftool: add mw2151 as a jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/504793 (https://phabricator.wikimedia.org/T192457) [21:34:23] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) >>! In T192457#5093559, @MoritzMuehlenhoff wrote: > During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data. https://gerrit.wikimedia.org/... [21:35:05] RECOVERY - Mediawiki Cirrussearch update lag - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:36:00] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [21:36:54] 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Backlog (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10Krenair) >>! In T200832#5061718, @akosiaris wrote: >>>! In T200832#5051312, @Krenair wrote: >> deployment-mathoid... [21:37:06] RECOVERY - Mediawiki Cirrussearch update lag - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:40:30] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:41:22] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:41:24] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:43:10] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:43:20] (03PS1) 10Dzahn: site/mw/conftool: assign mw2150 as jobrunner, mw22244 as API server [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) [21:46:55] (03PS2) 10Dzahn: site/mw/conftool: assign mw2150 as jobrunner, mw2244 as API server [puppet] - 10https://gerrit.wikimedia.org/r/504794 (https://phabricator.wikimedia.org/T192457) [21:48:55] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Dzahn) a:03faidon [21:57:56] (03CR) 10Dzahn: [C: 03+1] Add DNS entries for initiativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [22:00:49] 10Operations, 10hardware-requests, 10serviceops: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) [22:00:54] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Dzahn) [22:02:25] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) [22:02:30] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Open→03Stalled blocked on T215335 [22:05:15] 10Operations, 10hardware-requests, 10serviceops: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) @Robh Now that you are back.. can i please have this server assigned to me? That would unblock T190568 which has been waiting for quite a bit. It has alrea... [22:11:19] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) a:03RobH just assigning for the question in the 2 comments above [22:16:46] * Krinkle staging on mwdebug1002 [22:17:05] (03PS6) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) [22:21:42] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Dzahn) re: the next checkbox above " Prioritize which "junk" domains should be in the primary (works for non-SNI) S... [22:26:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew) [22:26:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Andrew) [22:28:00] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/Score/: Id58156cfca805 / T219342 (duration: 01m 03s) [22:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:06] T219342: Pageview performance timeline analysis (March 2019) - https://phabricator.wikimedia.org/T219342 [22:31:04] (03PS1) 10Chico Venancio: InitialiseSettings.php: add year namespaces for WikimaniaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) [22:32:09] (03CR) 10jerkins-bot: [V: 04-1] InitialiseSettings.php: add year namespaces for WikimaniaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) (owner: 10Chico Venancio) [22:34:27] 10Operations, 10Thumbor, 10hardware-requests: reallocate former image scaler to thumbor use - https://phabricator.wikimedia.org/T218323 (10Dzahn) > mw2245 reallocated -> reserved for thumbor (T218323) marked mw2245 as reserved for thumbor in site.pp, as i am working on that ticket to re-assign imagescal... [22:37:01] (03PS8) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [22:37:09] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) " Field tech isolated the fault location and is en route to perform a survey of the damage. " Wed, 17 Apr 2019 23:04 : Field tech is still working at the site to run OTDR and isolate the lo... [22:40:09] !log push firewall change to pfw3-codfw - T221278 [22:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:59] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.1/includes/: I3a50508178159 (duration: 01m 21s) [22:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:30] (03CR) 10Jforrester: [C: 04-1] "The commit message title does not match the rest of the commit contents." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504798 (https://phabricator.wikimedia.org/T221297) (owner: 10Chico Venancio) [22:57:57] (03CR) 10Alex Monk: [C: 03+2] config: Move ACMEChiefConfig to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504510 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [22:59:30] (03Merged) 10jenkins-bot: config: Move ACMEChiefConfig to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504510 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [23:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190417T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:14] (03PS1) 10Dzahn: admins: add missing email field for user pbj [puppet] - 10https://gerrit.wikimedia.org/r/504812 [23:01:02] (03CR) 10jenkins-bot: config: Move ACMEChiefConfig to its own module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504510 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [23:09:59] PROBLEM - puppet last run on aqs1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:15:34] (03CR) 10Krinkle: [C: 04-1] "Per previous patch set." [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [23:16:25] (03CR) 10Dzahn: [C: 03+2] admins: add missing email field for user pbj [puppet] - 10https://gerrit.wikimedia.org/r/504812 (owner: 10Dzahn) [23:18:49] (03CR) 10Alex Monk: [C: 04-1] dns: Move DNS operations to its own module (033 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504511 (owner: 10Vgutierrez) [23:32:38] (03PS1) 10Jforrester: wikitech: Re-disable UserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504814 [23:41:03] RECOVERY - puppet last run on aqs1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:42:23] (03CR) 10Alex Monk: "Are we potentially going to run into issues with it trying to reissue certs that have skipped SNIs because it thinks there must've been a " (035 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/504512 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [23:55:44] (03PS1) 10BryanDavis: ldap: Add support for sudo rules in sssd client config [puppet] - 10https://gerrit.wikimedia.org/r/504817 (https://phabricator.wikimedia.org/T221225)