[00:13:05] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [00:27:06] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [00:28:29] Hmm those warnings keep going once in a while. [00:29:10] paladox: I'm going to downtime the host [00:29:36] the alert is set for 80% usage, and the usage oscillate between 80% [00:29:47] XioNoX: thanks [00:30:37] downtimed for 24h [00:31:39] Thanks [00:43:06] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [00:53:04] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [00:53:38] hum [00:54:26] forgot to add the 2nd side of the link [02:39:14] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.25) (duration: 10m 54s) [02:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 741.06 seconds [03:57:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 253.21 seconds [04:10:59] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [04:12:59] RECOVERY - Host cp3048 is UP: PING WARNING - Packet loss = 64%, RTA = 84.59 ms [06:06:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420274 (https://phabricator.wikimedia.org/T187089) [06:10:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420274 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:11:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420274 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:11:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420274 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:13:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 for schema change, kernel upgrade and mariadb upgrade (duration: 00m 58s) [06:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:40] !log Stop MySQL on db1091 for kernel and mariadb upgrade [06:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:40] (03PS4) 10Marostegui: dbproxy100[2,7]: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) [06:20:38] !log Deploy schema change on db1091 - T187089 T185128 T153182 [06:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:45] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:20:45] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:20:45] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:28:16] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/apt/keys/ubuntucloud.gpg] [06:29:15] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:29:35] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:46:29] <_joe_> . [06:56:22] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4060161 (10Samwilson) [06:57:39] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4041118 (10Samwilson) a:05Samwilson>03None @robh thanks! I've added my new key, and signed L3. [06:58:16] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:06] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:35] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:24:24] (03PS1) 10Elukey: geowiki::job::monitoring: disable old/not-useful cron [puppet] - 10https://gerrit.wikimedia.org/r/420275 (https://phabricator.wikimedia.org/T173486) [07:24:26] (03CR) 10Elukey: [C: 032] geowiki::job::monitoring: disable old/not-useful cron [puppet] - 10https://gerrit.wikimedia.org/r/420275 (https://phabricator.wikimedia.org/T173486) (owner: 10Elukey) [07:25:15] (03PS3) 10Elukey: aptrepo: add cassandra22 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 [07:25:48] (03CR) 10Marostegui: [C: 032] dbproxy100[2,7]: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) (owner: 10Marostegui) [07:27:10] (03PS5) 10Marostegui: dbproxy100[2,7]: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) [07:27:35] !log Reload dbproxy1002 and dbproxy1007 to get the new config - T189773 [07:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:44] T189773: Decommission db1020 - https://phabricator.wikimedia.org/T189773 [07:30:34] (03PS4) 10Elukey: aptrepo: add cassandra22 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 [07:34:40] (03CR) 10Elukey: [C: 032] aptrepo: add cassandra22 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [07:35:31] (03PS1) 10Marostegui: db-eqiad.php: Move db1106 from s5 to s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420277 (https://phabricator.wikimedia.org/T183469) [07:36:37] (03PS2) 10Marostegui: db-eqiad.php: Move db1106 from s5 to s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420277 (https://phabricator.wikimedia.org/T183469) [07:41:35] (03PS1) 10Marostegui: mariadb: Move db1106 from s5 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/420278 (https://phabricator.wikimedia.org/T183469) [07:41:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Move db1106 from s5 to s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420277 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:42:16] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4060204 (10elukey) Creating the cassandra22 component in apt with https://gerrit.wikimedia.org/r/#/c/420059/ [07:42:42] (03Merged) 10jenkins-bot: db-eqiad.php: Move db1106 from s5 to s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420277 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:43:16] (03CR) 10jenkins-bot: db-eqiad.php: Move db1106 from s5 to s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420277 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:44:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Move db1106 from s5 to s1 - T183469 (duration: 01m 00s) [07:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:16] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [07:46:34] (03PS1) 10Elukey: Revert "Revert "role::aqs: enable Cassandra JMX exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/420279 [07:46:50] (03PS2) 10Elukey: Revert "Revert "role::aqs: enable Cassandra JMX exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/420279 [07:47:32] !log drain cassandra instances and reboot aqs1004 for kernel upgrades [07:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:56] (03CR) 10Elukey: [C: 032] Revert "Revert "role::aqs: enable Cassandra JMX exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/420279 (owner: 10Elukey) [07:50:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420281 (https://phabricator.wikimedia.org/T183469) [07:50:43] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/10501/" [puppet] - 10https://gerrit.wikimedia.org/r/420278 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:50:45] (03PS2) 10Marostegui: mariadb: Move db1106 from s5 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/420278 (https://phabricator.wikimedia.org/T183469) [07:54:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420281 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:54:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420281 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:54:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420281 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:55:43] (03CR) 10Marostegui: [C: 032] mariadb: Move db1106 from s5 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/420278 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:55:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T183469 (duration: 00m 57s) [07:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:53] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [07:58:12] (03PS1) 10Marostegui: s1,s5.hosts: Move db1106 to s1 [software] - 10https://gerrit.wikimedia.org/r/420282 (https://phabricator.wikimedia.org/T183469) [07:58:21] !log manually installed cassandra-2.2.6-wmf3 on aqs1004 - T189529 [07:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:26] T189529: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529 [08:00:53] (03CR) 10Marostegui: [C: 032] s1,s5.hosts: Move db1106 to s1 [software] - 10https://gerrit.wikimedia.org/r/420282 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [08:00:57] (03Merged) 10jenkins-bot: s1,s5.hosts: Move db1106 to s1 [software] - 10https://gerrit.wikimedia.org/r/420282 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [08:07:22] (03PS1) 10Muehlenhoff: Remove access for siddharth11 [puppet] - 10https://gerrit.wikimedia.org/r/420283 [08:10:08] (03CR) 10Muehlenhoff: [C: 032] Remove access for siddharth11 [puppet] - 10https://gerrit.wikimedia.org/r/420283 (owner: 10Muehlenhoff) [08:11:34] !log Reboot db1106 for kernel upgrade [08:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:30] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4060241 (10elukey) Tried to (manually via dpkg -i) install cassandra 2.2.6-wmf3 on aqs1004: ``` elukey@aqs1004:~$ dpkg -l | grep cassand... [08:19:24] !log Reset slave on db1106 to get it ready for s1 - https://phabricator.wikimedia.org/T183469 [08:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:05] (03PS1) 10Elukey: Revert "role::aqs: enable Cassandra JMX exporter" [puppet] - 10https://gerrit.wikimedia.org/r/420284 [08:21:06] (03PS2) 10Elukey: Revert "role::aqs: enable Cassandra JMX exporter" [puppet] - 10https://gerrit.wikimedia.org/r/420284 [08:21:37] (03CR) 10Elukey: [C: 032] Revert "role::aqs: enable Cassandra JMX exporter" [puppet] - 10https://gerrit.wikimedia.org/r/420284 (owner: 10Elukey) [08:22:02] !log revert previous state on aqs1004, the new pkg might need some more work - T189529 [08:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:08] T189529: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529 [08:26:07] !log installing libvorbis security updates [08:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:23] !log reboot thorium for kernel security upgrades (hosts all analytics websites, they will go down temporary) [08:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:18] !log installing openjdk-8 security updates [08:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:34] * elukey hides from moritzm [09:02:17] (03CR) 10Elukey: [C: 032] Allow the config of maximum tolerated failed volumes for the datanode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420031 (owner: 10Elukey) [09:04:15] (03PS1) 10Marostegui: db-eqiad.php: Depool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420285 [09:04:17] (03PS4) 10Filippo Giunchedi: Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) [09:05:27] (03PS1) 10Marostegui: es1016: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/420286 [09:05:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420285 (owner: 10Marostegui) [09:07:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420285 (owner: 10Marostegui) [09:07:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1016 for kernel, mariadb and socket location upgrade (duration: 00m 58s) [09:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:20] !log Stop MySQL on es1016 for kernel, mariadb and socket location upgrade [09:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:16] (03CR) 10Marostegui: [C: 032] es1016: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/420286 (owner: 10Marostegui) [09:09:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420285 (owner: 10Marostegui) [09:10:16] !log depool codfw puppetmaster - T184562 [09:10:22] (03CR) 10Filippo Giunchedi: [C: 032] Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:25] T184562: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562 [09:12:30] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:12:53] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419778 (https://phabricator.wikimedia.org/T135991) [09:13:20] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 79896 bytes in 0.366 second response time [09:16:12] (03PS1) 10Elukey: profile::hadoop::common: force the datanode to tolerate two disk failures [puppet] - 10https://gerrit.wikimedia.org/r/420287 [09:16:30] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:19:33] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420288 [09:20:56] (03PS1) 10Elukey: Fix hdfs-site template variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420289 [09:22:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool es1016 after kernel, mariadb and socket location upgrade (duration: 00m 58s) [09:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420288 (owner: 10Marostegui) [09:22:38] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420288 (owner: 10Marostegui) [09:22:52] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420288 (owner: 10Marostegui) [09:22:55] (03CR) 10Elukey: [V: 032 C: 032] Fix hdfs-site template variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420289 (owner: 10Elukey) [09:23:26] (03PS2) 10Filippo Giunchedi: cache: depool puppetmaster2001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/419795 (https://phabricator.wikimedia.org/T184562) [09:23:28] (03PS2) 10Elukey: profile::hadoop::common: force the datanode to tolerate two disk failures [puppet] - 10https://gerrit.wikimedia.org/r/420287 [09:24:32] (03CR) 10Filippo Giunchedi: [C: 032] cache: depool puppetmaster2001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/419795 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:27:03] !log reimage puppetmaster2001 with stretch - T184562 [09:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:09] T184562: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562 [09:30:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10503/" [puppet] - 10https://gerrit.wikimedia.org/r/420287 (owner: 10Elukey) [09:31:30] (03PS3) 10Elukey: profile::hadoop::common: force the datanode to tolerate two disk failures [puppet] - 10https://gerrit.wikimedia.org/r/420287 [09:31:32] (03CR) 10Elukey: [C: 032] profile::hadoop::common: force the datanode to tolerate two disk failures [puppet] - 10https://gerrit.wikimedia.org/r/420287 (owner: 10Elukey) [09:37:10] !log restart hadoop daemons on analytics1070 for openjdk upgrades (canary) [09:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 20 probes of 303 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:42:41] (03PS4) 10Jdrewniak: Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) [09:44:24] 10Operations, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4060424 (10Marostegui) Backup a binary copy of db1009: `es2001:/srv/backups/older/m5/db1009_binary_copy/db1009.tar.gz` Backup a logical copy of testreduce_0715.results table: `es2001:/srv/backups/older/m... [09:45:48] !log Stop MySQL on db1009 - T189216 [09:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:54] T189216: Decommission db1009 - https://phabricator.wikimedia.org/T189216 [09:46:30] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 13 probes of 303 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:48:49] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1009 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420293 (https://phabricator.wikimedia.org/T189216) [09:49:38] (03CR) 10Elukey: [C: 031] "LGTM! The only thing that I don't love is that all the hosts are in B6, so under maintenance we loose 4 nodes at once (not a big deal thou" [puppet] - 10https://gerrit.wikimedia.org/r/420011 (owner: 10Muehlenhoff) [09:50:09] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1009 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420293 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [09:52:55] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1009 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420293 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [09:52:57] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1009 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420293 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [09:54:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1009 from config - T189216 (duration: 00m 58s) [09:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:37] T189216: Decommission db1009 - https://phabricator.wikimedia.org/T189216 [09:55:38] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1009 from config - T189216 (duration: 00m 57s) [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:01] (03PS2) 10Ema: varnish: move gethdr_extrachance to runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T174932) [09:56:37] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4060488 (10Marostegui) [10:00:16] PROBLEM - puppet last run on db2085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:01:49] (03PS3) 10Ema: varnish: move gethdr_extrachance to runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T174932) [10:05:06] 10Operations, 10Analytics, 10Patch-For-Review: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4060495 (10elukey) [10:05:08] (03CR) 10Vgutierrez: [C: 031] varnish: move gethdr_extrachance to runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [10:09:16] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa-ita: Fix dependency [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/420020 (owner: 10KartikMistry) [10:11:14] (03CR) 10Ema: [C: 032] varnish: move gethdr_extrachance to runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/419705 (https://phabricator.wikimedia.org/T174932) (owner: 10Ema) [10:14:23] !log cp3008: upgrade to varnish 5.1.3-1wm4 [10:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:10] (03CR) 10Zfilipin: "removed myself from reviewers, I am not familiar with this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) (owner: 10Jdrewniak) [10:22:51] (03PS1) 10Marostegui: mariadb: Set db1009 to spare [puppet] - 10https://gerrit.wikimedia.org/r/420295 (https://phabricator.wikimedia.org/T189216) [10:22:55] (03PS1) 10Marostegui: m5.hosts: Remove db1009 [software] - 10https://gerrit.wikimedia.org/r/420296 (https://phabricator.wikimedia.org/T189216) [10:23:25] (03PS2) 10Marostegui: mariadb: Set db1009 to spare [puppet] - 10https://gerrit.wikimedia.org/r/420295 (https://phabricator.wikimedia.org/T189216) [10:25:09] !log Remove db1009 from tendril - T189216 [10:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] T189216: Decommission db1009 - https://phabricator.wikimedia.org/T189216 [10:25:37] (03CR) 10Marostegui: [C: 032] m5.hosts: Remove db1009 [software] - 10https://gerrit.wikimedia.org/r/420296 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [10:27:28] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4060598 (10Marostegui) [10:27:32] (03Merged) 10jenkins-bot: m5.hosts: Remove db1009 [software] - 10https://gerrit.wikimedia.org/r/420296 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [10:27:34] (03CR) 10Marostegui: [C: 032] mariadb: Set db1009 to spare [puppet] - 10https://gerrit.wikimedia.org/r/420295 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [10:27:49] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10505/" [puppet] - 10https://gerrit.wikimedia.org/r/420295 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [10:27:51] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035423 (10Marostegui) @chasemp can you please proceed and remove the ACL for db1009 now? [10:28:29] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4060600 (10Marostegui) [10:29:20] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035423 (10Marostegui) a:05jcrespo>03RobH This host is now ready for DC Ops decommissioning, so assigning it to @RobH [10:29:29] (03PS1) 10Giuseppe Lavagetto: hhvm::admin: fix location of references [puppet] - 10https://gerrit.wikimedia.org/r/420297 [10:30:19] RECOVERY - puppet last run on db2085 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:33:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420298 [10:33:40] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420298 [10:34:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420298 (owner: 10Marostegui) [10:36:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420298 (owner: 10Marostegui) [10:37:23] <_joe_> kart_: ping [10:37:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T183469 (duration: 00m 58s) [10:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:36] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [10:37:42] <_joe_> kart_: why does your service use the labs recommendation api instead of the production service? [10:37:46] (03PS1) 10Vgutierrez: Make log look tidier on pybal start-up. [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) [10:39:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420298 (owner: 10Marostegui) [10:40:55] _joe_: Does it deployed in Production? [10:41:13] (We shouldn't depends on Labs service though) [10:42:20] <_joe_> ... [10:42:21] <_joe_> yes [10:42:24] <_joe_> since months [10:42:38] <_joe_> I assumed you and research synced on that [10:42:46] (03PS2) 10Filippo Giunchedi: Use codfw puppetmasters in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/420003 [10:42:49] <_joe_> kart_: where is the config that points to the labs service? [10:43:13] (03CR) 10Filippo Giunchedi: [C: 032] Use codfw puppetmasters in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/420003 (owner: 10Filippo Giunchedi) [10:43:15] (03PS2) 10Vgutierrez: Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) [10:44:18] _joe_: one sec. Giving link. [10:46:10] PROBLEM - Disk space on kubernetes2004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/6a64a8ff-2b62-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/tiller-token-skwhs is not accessible: Permission denied [10:46:50] (03CR) 10Volans: "Nitpick 2 cents inline ;)" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [10:47:08] _joe_: https://phabricator.wikimedia.org/diffusion/ECTX/browse/master/extension.json;7bdecdab2f15218e727b28fafe7b4dea7b750874$151 - where we point it to URL. [10:47:08] !log point ulsfo puppet to puppetmaster2001 [10:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:21] which can be override using mw-config. [10:47:27] For production. [10:47:41] <_joe_> ok [10:47:59] <_joe_> what's the variable I have to override then? [10:48:17] $wgRecommendToolAPIURL [10:48:18] <_joe_> I will add the recommendation-api to ProductionServices, btw [10:48:26] <_joe_> ok I got that right then [10:48:29] <_joe_> Reedy: thanks [10:48:31] Might be careful where you add it, as they'll want it still there on labs [10:48:46] <_joe_> yeah I know, the whole thing is managed cleverly nowadays [10:48:56] Gonna do it via etcd? [10:49:14] <_joe_> somehow :P [10:49:16] _joe_: RecommendToolAPIURL [10:49:18] Easiest fix... [10:49:23] if ( $wmgUseContentTranslation ) { in both [10:49:37] Just set it manually in CommonSettings.php and CommonSettings-labs.php in that if [10:49:42] <_joe_> yes [10:51:12] (03PS1) 10Elukey: role::analytics_cluster::client: move config to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/420301 (https://phabricator.wikimedia.org/T167790) [10:51:42] thanks _joe_ Reedy kart_, I go back to my stuff. Ping me if you need something from me or my team [10:53:14] arturo: sure. [10:53:23] !log restarting jenkins on releases1001 to pick up Java security update [10:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:04] <_joe_> before doing the change, I was checking things would work [10:55:32] <_joe_> https://phabricator.wikimedia.org/diffusion/ECTX/browse/master/modules/dashboard/ext.cx.recommendtool.client.js;7bdecdab2f15218e727b28fafe7b4dea7b750874$76 this is a http GET request? [10:56:26] (03PS2) 10Filippo Giunchedi: Use codfw puppetmasters in eqsin [dns] - 10https://gerrit.wikimedia.org/r/420004 [10:56:58] _joe_: the $.get? yeah, that's a jquery http get [10:57:34] <_joe_> yeah, that doesn't work with the production service [10:57:37] <_joe_> just tested [10:57:50] <_joe_> so I have no idea wtf is going on here [10:58:05] (03CR) 10Filippo Giunchedi: [C: 032] Use codfw puppetmasters in eqsin [dns] - 10https://gerrit.wikimedia.org/r/420004 (owner: 10Filippo Giunchedi) [10:58:39] !log point eqsin puppet to puppetmaster2001 [10:58:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10506/stat1005.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/420301 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:39] _joe_: looking and checking with team about url. [11:00:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:01:23] (03CR) 10Jdrewniak: [C: 032] Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) (owner: 10Jdrewniak) [11:02:19] jan_drewniak: I am around :] [11:02:28] (03Merged) 10jenkins-bot: Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) (owner: 10Jdrewniak) [11:02:45] (03CR) 10jenkins-bot: Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) (owner: 10Jdrewniak) [11:02:47] (03PS3) 10Vgutierrez: Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) [11:03:18] hashar: thanks! [11:03:50] <_joe_> yeah it seems that the production service wants "/{domain}/v1/translation/articles/{source}{/seed}": [11:04:09] (03CR) 10Vgutierrez: Make log look tidier on pybal start-up (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [11:04:54] _joe_: right. Things are changed there. [11:05:44] <_joe_> sigh [11:05:57] <_joe_> ok so we're back to square one [11:06:04] !log uploaded openjdk-8 8u162-b12-1~bpo8+1 for jessie-wikimedia to apt.wikimedia.org [11:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:34] _joe_: thanks for help! [11:06:49] (and discovering Production service) [11:07:32] (03PS2) 10Muehlenhoff: Repurpose four image scalers as video scalers [puppet] - 10https://gerrit.wikimedia.org/r/420011 [11:08:52] (03CR) 10Mark Bergsma: Make log look tidier on pybal start-up (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [11:09:29] <_joe_> kart_: I won't troubleshoot the labs service now, I think it's up to you/research to sync and fix this [11:09:42] _joe_: yep. [11:09:51] (03CR) 10Muehlenhoff: [C: 032] Repurpose four image scalers as video scalers [puppet] - 10https://gerrit.wikimedia.org/r/420011 (owner: 10Muehlenhoff) [11:09:54] _joe_: I'm filing bug. [11:12:15] RECOVERY - Disk space on kubernetes2004 is OK: DISK OK [11:12:26] (03PS1) 10Muehlenhoff: Reimage mw1293-mw1296 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/420302 [11:12:51] 10Operations, 10DBA, 10Goal: Generate consistent logical database backups in CODFW - https://phabricator.wikimedia.org/T184699#4060677 (10jcrespo) 05Open>03Resolved a:03jcrespo Done at T184696 and T183735 [11:13:26] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:393239|Switching portals submodule to portals-deploy (T180777)]] (duration: 00m 58s) [11:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:31] T180777: Move portal deployment artifacts into their own repo - https://phabricator.wikimedia.org/T180777 [11:13:44] (03PS1) 10Alexandros Kosiaris: network: Add pods IPv6 space [puppet] - 10https://gerrit.wikimedia.org/r/420303 [11:14:25] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:393239|Switching portals submodule to portals-deploy (T180777)]] (duration: 00m 58s) [11:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:54] PROBLEM - mediawiki-installation DSH group on mw1295 is CRITICAL: Host mw1295 is not in mediawiki-installation dsh group [11:15:23] hashar: ... I did the sync [11:15:24] 10Operations, 10DBA, 10Goal: Generate consistent logical database backups in CODFW - https://phabricator.wikimedia.org/T184699#4060692 (10jcrespo) [11:15:43] hashar: but all did not go well [11:15:49] :D [11:15:58] ah a whole sync [11:16:26] (03CR) 10Alexandros Kosiaris: [C: 032] network: Add pods IPv6 space [puppet] - 10https://gerrit.wikimedia.org/r/420303 (owner: 10Alexandros Kosiaris) [11:16:28] <_joe_> kart_: tbh, I'm not sure anymore those are the same service, I just remembering I was asked like 10 months ago to deploy recommendation-api for content-translation, and tbh I won't fix a service running in labs; those are the responsibility of individual teams [11:16:34] https://www.wikipedia.org/ 403 forbidden [11:16:36] jan_drewniak: gotta revert I guess [11:16:47] <_joe_> oh what? [11:16:55] hashar: yeah.. it worked on mwdebug... [11:17:12] !log cache_misc@esams: upgrade to varnish 5.1.3-1wm4 [11:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:17] <_joe_> jan_drewniak: you NEED to use apache-fast-test [11:17:44] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - kubelet_operational_latencies is 47917 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:17:44] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - kubelet_operational_latencies is 50148 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:17:56] jan_drewniak: or some file havent been synced? [11:18:01] (03PS2) 10Muehlenhoff: Reimage mw1293-mw1296 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/420302 [11:18:08] _joe_: yes. I need to get ack from research team if both are same. [11:18:17] <_joe_> please do [11:18:34] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - kubelet_operational_latencies is 55193 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:18:44] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 56103 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:18:54] hashar: scap ran successfully so I think files were synced [11:19:02] /srv/mediawiki/docroot/wwwportal/portal -> ../../portals/prod [11:19:05] that does not exist [11:19:35] hashar: that was in the patch! I guess that didn't sync? [11:19:39] <_joe_> hashar: uhm I get a 200 ok for www.wikipedia.org [11:19:42] hmm [11:20:06] 403 for me [11:20:13] <_joe_> but on the appserver directly [11:20:26] I'm also getting the "Forbidden. You don't have permission to access / on this server." [11:20:53] <_joe_> yeah I'm trying to understand what's wrong there [11:21:10] (03CR) 10Muehlenhoff: [C: 032] Reimage mw1293-mw1296 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/420302 (owner: 10Muehlenhoff) [11:21:28] hashar: shoot, ok, do I need to create a new patch to revert (sorry never had to revert before) [11:21:50] jan_drewniak: what I get is that /srv/mediawiki/docroot/wwwportal/portal got changed from '../../portals/prod' to '../../portals' but that ends up being a broken link (the target does not exist) [11:21:59] <_joe_> yeah [11:22:05] PROBLEM - mediawiki-installation DSH group on mw1296 is CRITICAL: Host mw1296 is not in mediawiki-installation dsh group [11:22:24] PROBLEM - Disk space on kubernetes2003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/7564500f-2b67-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/tiller-token-8f64j is not accessible: Permission denied [11:22:46] and if I go to /srv/mediawiki/portals there is no "prod" directory there [11:23:07] hmm [11:23:12] yeah because the patch drop it bah [11:23:26] <_joe_> yeah revert [11:23:30] <_joe_> and then we have to purge [11:23:34] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - kubelet_operational_latencies is 4667 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:23:35] hashar: yeah in my patch I changed the symlink because the directory was not needed anymore https://gerrit.wikimedia.org/r/#/c/393239/5/docroot/wwwportal/portal [11:23:42] sorry! [11:23:45] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - kubelet_operational_latencies is 5023 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:23:45] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - kubelet_operational_latencies is 6686 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:23:52] <_joe_> let's revert first, ask questions later [11:24:04] (03PS1) 10Alexandros Kosiaris: Update mathoid chart to resemble current production [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 [11:24:11] <_joe_> interestingly [11:24:16] <_joe_> mwdebug1002 works [11:24:20] <_joe_> the others no [11:24:23] <_joe_> let me try something [11:24:28] I think that is a sync issue [11:24:34] <_joe_> yes [11:24:43] staging is fine [11:24:44] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - kubelet_operational_latencies is 4905 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:24:51] <_joe_> what is staging? [11:24:55] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner/jobrunner] [11:25:28] ie: /srv/mediawiki-staging/docroot/wwwportal/portal points properly to ../../portals [11:25:32] <_joe_> confirmed, running "scap pull" fixes things [11:25:42] <_joe_> hashar: can you do a full scap sync? [11:25:44] but /srv/mediawiki/docroot/wwwportal/portal points to broken ../../portals/prod [11:25:46] <_joe_> that should fix things [11:25:48] yeah [11:25:53] jan_drewniak: how have you synced? [11:25:58] <_joe_> brb [11:27:11] hashar: https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/portals/deploy/+/refs/heads/master/sync-portals [11:27:24] RECOVERY - Disk space on kubernetes2003 is OK: DISK OK [11:27:32] (03PS2) 10Filippo Giunchedi: Use codfw puppetmasters in codfw [dns] - 10https://gerrit.wikimedia.org/r/420005 [11:27:39] !log hashar@tin Synchronized docroot/wwwportal/portal: (no justification provided) (duration: 00m 57s) [11:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:51] should be good now [11:27:58] <_joe_> not in varnish I guess [11:28:02] <_joe_> let me check internally [11:28:05] I did a scap sync-file docroot/wwwportal/portal [11:28:19] jan_drewniak: https://www.wikipedia.org/?burst works for me now [11:28:20] hashar: magic! [11:28:24] so the symlink hasn't been synced [11:28:29] <_joe_> confirmed ok on the servers [11:28:54] jan_drewniak: yeah the sync-portals script does not sync it :D [11:28:57] <_joe_> and also on the caches [11:28:59] PROBLEM - Disk space on kubernetes2001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/5e150d90-2b68-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/tiller-token-8f64j is not accessible: Permission denied [11:29:14] me ^. Known, ignore please [11:29:16] it just does the files under portals/ but misses docroot/wwwportal/portal [11:29:25] (03CR) 10Filippo Giunchedi: [C: 032] Use codfw puppetmasters in codfw [dns] - 10https://gerrit.wikimedia.org/r/420005 (owner: 10Filippo Giunchedi) [11:29:33] hashar: omg your right, I should add that [11:29:42] !log point codfw puppet to puppetmaster2001 [11:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:51] _joe_: thanks for the assistance :] [11:30:35] <_joe_> yw [11:30:46] hashar _joe_ thank you both! [11:31:05] jan_drewniak: and congratulations for the move toward a submodule \o/ [11:34:55] hashar: yeah, that task can finally be closed! :+1: [11:35:20] (03PS1) 10Filippo Giunchedi: Revert "cache: depool puppetmaster2001 from config-master.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/420309 (https://phabricator.wikimedia.org/T184562) [11:36:09] (03CR) 10Filippo Giunchedi: [C: 032] Revert "cache: depool puppetmaster2001 from config-master.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/420309 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [11:41:04] (03CR) 10ArielGlenn: [C: 031] "I wonder what happens when one host (lab100x) is down e.g. for maintenance: repeated puppet failures on the stat host? Other than that con" [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) (owner: 10Madhuvishy) [11:41:25] (03PS4) 10Vgutierrez: Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) [11:42:10] (03CR) 10jerkins-bot: [V: 04-1] Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [11:43:38] (03PS5) 10Vgutierrez: Make log look tidier on pybal start-up [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) [11:44:04] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4060764 (10fgiunchedi) puppetmaster2001 was reimaged with stretch and traffic moved back as planned, notes from the process: # The procedure sho... [11:44:22] (03CR) 10Vgutierrez: Make log look tidier on pybal start-up (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/420299 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [11:44:30] !log reimage mw1293 as video scaler [11:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:18] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10507/ as expected, this just removes the classes from terbium/wasat and the snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/420297 (owner: 10Giuseppe Lavagetto) [11:46:26] (03PS2) 10Giuseppe Lavagetto: hhvm::admin: fix location of references [puppet] - 10https://gerrit.wikimedia.org/r/420297 [11:48:40] https://www.irccloud.com/pastebin/PBSUQx4c/ [11:49:04] kart_: _joe_ ^^^ any idea how to solve that? is seems a file is missing [11:49:59] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:51:30] <_joe_> arturo: have you checked if uwsgi was upgraded during the weekend? if so, that could be the cause [11:51:48] _joe_: it wasn't [11:54:12] <_joe_> ok then this is a problem for the research team to solve [11:55:18] <_joe_> !log stopping hhvm on terbium for a test. [11:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:33] who is bmansurov ? [11:56:30] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:59:22] _joe_: the service started failing 2 or 3 minutes after bmansurov ssh'd to the server [12:01:56] <_joe_> arturo: ok, so let's assume a bad deploy happened and let's wait for research to fix that [12:02:23] <_joe_> that think is also using /usr/local/bin/uwsgi, according to your paste earlier [12:02:32] <_joe_> go figure what was done on that VM [12:03:00] yeah [12:03:12] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboots of dumps/snapshot hosts - https://phabricator.wikimedia.org/T188242#4060788 (10ArielGlenn) Mediawiki flow history dumps were interrupted and are re-running, so while they will finish up right before the new run, the wikidata weeklies hav... [12:04:17] what's wrong with phabricator? [12:05:16] Error [12:05:16] Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes. [12:05:17] See the error message at the bottom of this page for more information. [12:06:58] (it was a transient error, it works now) [12:08:53] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:12:13] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420312 [12:13:53] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:14:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420312 (owner: 10Marostegui) [12:15:29] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420312 (owner: 10Marostegui) [12:16:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore es1016 original weight after kernel, mariadb and socket location upgrade (duration: 00m 58s) [12:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:13] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:17:14] 10Operations, 10DBA, 10Patch-For-Review: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4060833 (10Marostegui) @akosiaris we are planning to suggest in the meeting today: tomorrow Tuesday at 16:00UTC, would that work for you? [12:17:53] 10Operations, 10DBA, 10Patch-For-Review: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4060834 (10akosiaris) Yes, that's fine. [12:19:03] 10Operations, 10DBA, 10Patch-For-Review: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4060839 (10Marostegui) Awesome! Thanks! We will mention it on the meeting today then, and we'll see what we get :) [12:19:12] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for es1016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420312 (owner: 10Marostegui) [12:31:31] PROBLEM - Check size of conntrack table on mw1293 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:31:31] PROBLEM - configured eth on mw1293 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:32:21] PROBLEM - Check systemd state on labtestmetal2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:33:11] PROBLEM - Check systemd state on mw1293 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:11] PROBLEM - dhclient process on mw1293 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:34:18] ^ race in reimage, all fine [12:37:22] !log installing curl security updates [12:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:24] !log T189722 reboot labtestcontrol2001 [12:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:01] PROBLEM - Host labtestcontrol2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:42:23] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420314 [12:42:31] ouch, it boot without grub prompt [12:42:50] RECOVERY - Host labtestcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [12:47:27] (03PS1) 10Sbisson: [DO NOT MERGE] Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) [12:50:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420314 (owner: 10Marostegui) [12:50:30] RECOVERY - dhclient process on mw1293 is OK: PROCS OK: 0 processes with command name dhclient [12:50:50] RECOVERY - Check size of conntrack table on mw1293 is OK: OK: nf_conntrack is 0 % full [12:50:50] RECOVERY - configured eth on mw1293 is OK: OK - interfaces up [12:51:37] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420314 (owner: 10Marostegui) [12:51:52] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420314 (owner: 10Marostegui) [12:52:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1091 (duration: 00m 57s) [12:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:30] RECOVERY - Check systemd state on mw1293 is OK: OK - running: The system is fully operational [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T1300). [13:00:04] Ahmed123 and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:28] I can SWAT today [13:02:09] !log labtestcontrol2001: set GRUB_TIMEOUT=30 in /etc/default/grub, the previous value (10) wasn't enough to display the menu via mgmt [13:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:37] (03PS1) 10Jcrespo: dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) [13:04:09] Ahmed123 and Hauskatze: around for swat? [13:04:18] (03PS5) 10Giuseppe Lavagetto: hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 [13:04:19] Yes i'm ready [13:04:51] (03PS1) 10Jcrespo: mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) [13:05:01] (03CR) 10Giuseppe Lavagetto: [C: 032] "I re-created both dashboards I found with prometheus-backed data. we Can safely merge this change now." [puppet] - 10https://gerrit.wikimedia.org/r/415828 (owner: 10Giuseppe Lavagetto) [13:05:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) (owner: 10Ahmed123) [13:06:17] (03CR) 10Marostegui: [C: 031] mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [13:06:57] Ahmed123: your patch will be at mwdebug1002 in a few minutes, do you know how to test there? [13:07:14] (03Merged) 10jenkins-bot: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) (owner: 10Ahmed123) [13:07:49] yes i know [13:08:34] Ahmed123: 419535 is at mwdebug1002, please test and let me know if i can deploy it [13:09:04] i test the patch, everything is ok [13:09:14] you can merge it [13:09:16] (03CR) 10jenkins-bot: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) (owner: 10Ahmed123) [13:09:33] !log reimage mw1294-1296 as video scalers [13:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:48] Ahmed123: ok, deploying [13:09:55] Thanks [13:10:55] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:419535|Enable rollbacker user right at arwikiquote (T189732)]] (duration: 00m 57s) [13:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:00] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/diamond/collectors/HhvmApc/HhvmApc.py] [13:11:01] T189732: Creation of Rollbacker group on ar.wikiquote - https://phabricator.wikimedia.org/T189732 [13:11:10] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/diamond/collectors/HhvmApc/HhvmApc.py] [13:11:15] Ahmed123: it's deployed, please test [13:11:23] <_joe_> ah the usual damn race condition [13:11:36] (03PS8) 10Rduran: Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [13:11:37] <_joe_> those puppet alerts are false, discard [13:11:38] (03PS3) 10Rduran: Add flake8 config and requirement [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420015 [13:11:39] yes, it works. Thank you [13:12:47] 10Operations, 10Packaging: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729#4060962 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:12:53] 10Operations, 10Packaging: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741#4060963 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:12:54] zeljkof: Please let me know when you're done with the rest of SWAT, I have a patch to backport. [13:13:03] anomie: sure [13:14:26] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 (https://phabricator.wikimedia.org/T148603) (owner: 10Ahmed123) [13:14:31] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/diamond/collectors/HhvmApc/HhvmApc.py] [13:15:42] (03Merged) 10jenkins-bot: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 (https://phabricator.wikimedia.org/T148603) (owner: 10Ahmed123) [13:16:54] Ahmed123: 418700 is at mwdebug1002, please test and let me know if i can deploy it [13:18:45] yes, it works good [13:19:01] 10Operations, 10Packaging: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729#4060984 (10MoritzMuehlenhoff) jessie-backports ships a backport of the version in stretch (3.5.2.2), would that version work for you? It installed just fine on a test host for me (and we can i... [13:19:10] Ahmed123: ok, deploying [13:19:30] (03CR) 10jenkins-bot: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 (https://phabricator.wikimedia.org/T148603) (owner: 10Ahmed123) [13:19:40] Hauskatze: your patch will not be deployed if you are not around [13:20:11] !log zfilipin@tin Synchronized wmf-config/flaggedrevs.php: SWAT: [[gerrit:418700|Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" (T148603 T189224)]] (duration: 00m 58s) [13:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:17] T148603: Limit the Quality version of the flagged revision in Arabic Wikipedia to ns=0 - https://phabricator.wikimedia.org/T148603 [13:20:17] T189224: FlaggedRevs bar for a template shown in all articles though there are no template versions to review - https://phabricator.wikimedia.org/T189224 [13:20:22] Ahmed123: it's deployed, please test [13:21:28] it works fine, thank you [13:22:19] Ahmed123: great, thanks for deploying with #wikimedia-releng ;) [13:22:53] !log EU SWAT finished [13:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:10] anomie: eu swat is finished [13:23:14] Thanks [13:35:29] (03PS2) 10Jcrespo: dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) [13:35:31] (03PS2) 10Jcrespo: mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) [13:35:33] (03PS1) 10Jcrespo: mariadb: Move default socket for misc services to /run [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) [13:35:53] (03PS2) 10Jcrespo: mariadb: Move default socket for misc services to /run [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) [13:36:51] 10Operations, 10Packaging: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741#4061010 (10MoritzMuehlenhoff) I have created https://gerrit.wikimedia.org/r/operations/debs/python-aiokafka, could you import your package there for review, please? [13:37:36] PROBLEM - Disk space on kubernetes2004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/610146e8-2b7a-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/default-token-j85g3 is not accessible: Permission denied [13:40:56] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:41:07] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:41:37] RECOVERY - Disk space on kubernetes2004 is OK: DISK OK [13:44:16] PROBLEM - Disk space on kubernetes2002 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/42e9a3f6-2b7b-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/default-token-j85g3 is not accessible: Permission denied [13:44:36] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:44:43] !log anomie@tin Synchronized php-1.31.0-wmf.25/includes/filerepo/file/LocalFile.php: Applying fix for T189985 (duration: 00m 58s) [13:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] T189985: image_comment_temp entries aren't being moved when a file is renamed - https://phabricator.wikimedia.org/T189985 [13:46:49] (03PS1) 10Andrew Bogott: get_images: chase down one more image-sync corner case [wikitech-static] - 10https://gerrit.wikimedia.org/r/420332 [13:47:01] (03CR) 10Andrew Bogott: [V: 032 C: 032] get_images: chase down one more image-sync corner case [wikitech-static] - 10https://gerrit.wikimedia.org/r/420332 (owner: 10Andrew Bogott) [13:48:07] !log Cleaning up orphaned image_comment_temp rows on all wikis for T189985 [13:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:54] (03PS1) 10Muehlenhoff: Add Cumin aliases for ores [puppet] - 10https://gerrit.wikimedia.org/r/420334 [13:54:16] RECOVERY - Disk space on kubernetes2002 is OK: DISK OK [13:54:56] RECOVERY - Check systemd state on labtestmetal2001 is OK: OK - running: The system is fully operational [13:54:57] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4061093 (10Eevans) >>! In T189529#4060241, @elukey wrote: > Tried to (manually via dpkg -i) install cassandra 2.2.6-wmf3 on aqs1004: > >... [13:56:20] (03PS2) 10Elukey: role::analytics_cluster::client: move config to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/420301 (https://phabricator.wikimedia.org/T167790) [13:56:22] (03PS1) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) [13:57:56] PROBLEM - Check systemd state on labtestmetal2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:58:42] PROBLEM - Disk space on kubernetes2004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/56e88c0c-2b7d-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/default-token-j85g3 is not accessible: Permission denied [14:02:01] PROBLEM - Disk space on kubernetes2003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/b5a4e8a4-2b7d-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/default-token-npnbh is not accessible: Permission denied [14:02:21] PROBLEM - Check whether ferm is active by checking the default input chain on mw1294 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:21] PROBLEM - nutcracker port on mw1294 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:21] PROBLEM - Check size of conntrack table on mw1296 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:21] PROBLEM - configured eth on mw1296 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:01] PROBLEM - DPKG on mw1294 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:02] PROBLEM - nutcracker process on mw1294 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:02] PROBLEM - Check systemd state on mw1296 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:02] PROBLEM - dhclient process on mw1296 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:38] * elukey checks nitrogen [14:05:01] RECOVERY - Disk space on kubernetes2003 is OK: DISK OK [14:05:02] ah no [14:05:10] I misread the alerts :) [14:05:31] those are the new mw videoscalers [14:05:32] (03PS1) 10Lucas Werkmeister (WMDE): Disable reading wb_terms search fields on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420336 (https://phabricator.wikimedia.org/T189776) [14:05:42] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1296 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:05:42] PROBLEM - mediawiki-installation DSH group on mw1296 is CRITICAL: Host mw1296 is not in mediawiki-installation dsh group [14:06:41] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:19] yeah, silencing [14:09:53] !log restarting apache on contint1001 to pick up curl security update [14:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:42] (03CR) 10Ottomata: profile::analytics::cluster::client: add check for /mnt/hdfs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [14:10:50] 10Operations, 10Puppet, 10Release-Engineering-Team: puppetdb4: use postgres db backend in puppet-compiler - https://phabricator.wikimedia.org/T187258#4061133 (10herron) Ready to begin upgrading the puppet compiler now (in preparation for the prod puppetdb upgrade). Here's the process I have in mind: # disa... [14:11:48] (03CR) 10Ottomata: [C: 031] role::analytics_cluster::client: move config to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/420301 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:13:19] (03PS2) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) [14:13:52] (03CR) 10Elukey: "Forgot to put WIP! Wanted to have a chat with you first, buuuut we can do it in here :)" [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [14:16:11] RECOVERY - dhclient process on mw1296 is OK: PROCS OK: 0 processes with command name dhclient [14:16:22] RECOVERY - Check size of conntrack table on mw1296 is OK: OK: nf_conntrack is 0 % full [14:16:22] RECOVERY - configured eth on mw1296 is OK: OK - interfaces up [14:17:12] RECOVERY - DPKG on mw1294 is OK: All packages OK [14:17:16] (03PS1) 10Muehlenhoff: Add four new video scalers [puppet] - 10https://gerrit.wikimedia.org/r/420338 [14:17:22] RECOVERY - Check whether ferm is active by checking the default input chain on mw1294 is OK: OK ferm input default policy is set [14:17:30] (03CR) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [14:17:31] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [14:17:50] (03CR) 10Elukey: [C: 032] role::analytics_cluster::client: move config to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/420301 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:19:39] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 43533 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:19:50] (03PS2) 10Alexandros Kosiaris: Update mathoid chart to resemble current production [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 [14:19:52] (03PS1) 10Alexandros Kosiaris: Fix wrongly indented externalIPs field [deployment-charts] - 10https://gerrit.wikimedia.org/r/420341 [14:20:39] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: OK - kubelet_operational_latencies is 1924 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:21:44] (03PS1) 10Filippo Giunchedi: Stop using NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420342 [14:24:01] (03CR) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [14:24:58] (03PS16) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [14:25:19] RECOVERY - nutcracker process on mw1294 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [14:25:30] RECOVERY - nutcracker port on mw1294 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:28:00] PROBLEM - Disk space on kubernetes2002 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/65cf8aa1-2b81-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/default-token-npnbh is not accessible: Permission denied [14:28:21] PROBLEM - Disk space on kubernetes2003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/6623314a-2b81-11e8-9d96-aa000081eedf/volumes/kubernetes.iosecret/default-token-npnbh is not accessible: Permission denied [14:28:26] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes1001.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [14:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [14:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:55] mobrovac: FYI ^ [14:32:22] hi, quick question, why do we run apt-get update on /bin (not apt-get update) ? [14:32:24] https://github.com/wikimedia/puppet/blame/7e79228cd060b88fd83f2c731bad73f89fde9477/modules/apt/manifests/init.pp#L5 [14:32:49] that dosen't work, at least not with the phabricator class [14:32:56] ? [14:33:04] I can't parse the question [14:33:40] (03PS1) 10Andrew Bogott: reload apache after image sync. [wikitech-static] - 10https://gerrit.wikimedia.org/r/420349 [14:33:45] what does apt-get update have to do with /bin ? [14:33:51] (03CR) 10Andrew Bogott: [V: 032 C: 032] reload apache after image sync. [wikitech-static] - 10https://gerrit.wikimedia.org/r/420349 (owner: 10Andrew Bogott) [14:34:07] (03CR) 10Gilles: webperf: Always record country specific when oversampling (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [14:34:09] akosiaris according to https://github.com/wikimedia/puppet/blame/7e79228cd060b88fd83f2c731bad73f89fde9477/modules/apt/manifests/init.pp#L5 that how it is running it [14:34:18] which dosen't work as /bin is not apt-get update [14:34:45] !log reboot kafka1002 (eventbus/jobqueue) for kernel upgrades [14:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:50] it does say /usr/bin there so I am not sure what the problems is [14:35:16] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4061281 (10fgiunchedi) https://gerrit.wikimedia.org/r/c/420342/ for the namevirtualhost deprecation [14:35:16] akosiaris it's not running apt-get update though [14:35:24] as i am trying it out with the phabricator class [14:35:42] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1296 is OK: OK: synced at Mon 2018-03-19 14:35:40 UTC. [14:35:51] and am finding that it's not installing php7.2 even after apt::repository runs that exec [14:36:02] it has a refreshonly => true [14:36:25] it only runs if another resource informs that resource that it needs to run [14:36:31] yep [14:36:31] RECOVERY - Check systemd state on mw1296 is OK: OK - running: The system is fully operational [14:36:36] and of course this has nothing to do with /bin or /usr/bin [14:36:40] akosiaris like https://github.com/wikimedia/puppet/blob/production/modules/apt/manifests/repository.pp#L29 ? [14:36:56] akosiaris i am meaning the path => '/usr/bin', is wrong [14:37:00] why ? [14:37:12] it should be command => '/usr/bin/apt-get update', [14:37:37] or it can be command => apt-get update, path=> 'usr/bin/ [14:37:54] and command is not even required since the name will be used if omitted [14:38:09] hmm, let me test it with both command and path [14:38:33] - **command** (*namevar*) [14:38:33] The actual command to execute. Must either be fully qualified [14:38:33] or a search path for the command must be provided [14:38:39] :-) [14:40:06] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes2002.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [14:40:10] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [14:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [14:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:32] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes1002.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [14:40:34] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [14:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:36] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: kubernetes1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [14:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:14] (03CR) 10Elukey: [C: 031] Stop using NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420342 (owner: 10Filippo Giunchedi) [14:42:11] (03CR) 10Giuseppe Lavagetto: [C: 031] Stop using NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420342 (owner: 10Filippo Giunchedi) [14:42:55] !log T184919 pool all kubernetes for service mathoid. [14:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:02] T184919: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919 [14:45:55] (03PS2) 10Filippo Giunchedi: Stop using NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420342 [14:45:57] (03PS1) 10Filippo Giunchedi: puppetmaster: disable puppet-master service [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) [14:46:34] (03CR) 10Filippo Giunchedi: [C: 032] Stop using NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420342 (owner: 10Filippo Giunchedi) [14:46:46] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: disable puppet-master service [puppet] - 10https://gerrit.wikimedia.org/r/420351 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [14:48:26] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2597 bytes in 0.319 second response time [14:48:40] wow [14:48:43] <_joe_> it uh? [14:48:43] ugh [14:48:46] (03PS1) 10Alexandros Kosiaris: Ignore /var/lib/kubelet in disk_space checks [puppet] - 10https://gerrit.wikimedia.org/r/420353 [14:48:47] wut? [14:49:02] <_joe_> what's up? [14:49:08] not sure yet [14:49:12] text-lb is probably me [14:49:16] ah [14:49:47] testing? [14:50:03] <_joe_> a ton of 503s there AFAICT [14:50:20] false positive or real outage? [14:51:06] <_joe_> looks like a real outage? 300 5xx per second in ulsfo [14:51:26] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17449 bytes in 0.501 second response time [14:51:30] yes, real, but not random, see -traffic [14:51:43] (we're testing theories) [14:51:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:51:59] I don't see 503s on cp4027's fe with varnishlog [14:52:22] it's already gone [14:52:25] ema: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=now-1h&to=now [14:52:31] <_joe_> ema: I looked here https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&from=now-30m&to=now&var-site=ulsfo&var-cache_type=varnish-text&var-cache_type=varnish-misc&var-cache_type=varnish-upload&var-status_type=5 [14:52:49] oh ok, so too late to see them [14:52:56] <_joe_> yup [14:52:58] so, over and cause is known? [14:53:01] (03PS1) 10Ottomata: Set retries=3 for eventbus kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/420355 (https://phabricator.wikimedia.org/T180017) [14:53:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:53:39] paravoid: yes, we're actively experimenting, and an experiment had apparently-bad consequences [14:54:08] (03CR) 10Ottomata: [C: 032] Set retries=3 for eventbus kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/420355 (https://phabricator.wikimedia.org/T180017) (owner: 10Ottomata) [14:54:45] (03CR) 10Ppchelko: [C: 031] Set retries=3 for eventbus kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/420355 (https://phabricator.wikimedia.org/T180017) (owner: 10Ottomata) [14:57:46] (03PS17) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [14:59:57] (03PS1) 10Ottomata: Include proper eventloggingctl script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/420357 [15:03:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:03:51] (03CR) 10Ottomata: [C: 032] Include proper eventloggingctl script for jessie [puppet] - 10https://gerrit.wikimedia.org/r/420357 (owner: 10Ottomata) [15:03:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:04:43] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420358 [15:05:51] !log upgrading java on contint1001 / contint2001 [15:05:53] akosiaris i've managed to work around the issue with https://github.com/wikimedia/puppet/blob/36578fc2ae29ecca4497e8925f932570b152306f/modules/openstack/manifests/cloudrepo.pp#L22 now. [15:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:33] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4061400 (10Marostegui) >>! In T189216#4058198, @Marostegui wrote: > So, the checks finished and there were differences on testreduce_0715.results (173GB) table, between the followin... [15:07:37] (03PS2) 10Muehlenhoff: Add four new video scalers [puppet] - 10https://gerrit.wikimedia.org/r/420338 [15:08:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420358 (owner: 10Marostegui) [15:09:46] (03CR) 10Muehlenhoff: [C: 032] Add four new video scalers [puppet] - 10https://gerrit.wikimedia.org/r/420338 (owner: 10Muehlenhoff) [15:10:14] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420358 (owner: 10Marostegui) [15:10:29] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420358 (owner: 10Marostegui) [15:11:12] paladox: I haven't even understood what issue you had but cool if you sidestepped it [15:11:24] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4061412 (10Paladox) With https://gerrit.wikimedia.org/r/#/c/410245/ this fixes stretch support, we need to re arange loa... [15:11:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1091 (duration: 01m 01s) [15:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:59] (03CR) 10Marostegui: [C: 031] dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [15:13:13] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4061414 (10Cmjohnson) Replaced both disks @robh please close this task once confirmed issues has been resolved Return shipping for both disk in one box 9202 3946 5301 2438 2758 56 9... [15:16:46] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4061437 (10Cmjohnson) Replaced the second disk @robh please close this task once confirmed issues has been resolved Return Shipping 9302 3946 5301 2438 2536 63 9611918 2393026 75003684 [15:23:51] !log reboot kafka1003 for kernel upgrades (jobqueues/eventbus) [15:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:44] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4061484 (10Cmjohnson) @ayounsi asw-b fpc1 - fpce8 should be up now [15:26:50] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4061486 (10ayounsi) Confirmed, thanks! [15:29:22] (03PS1) 10Marostegui: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420364 [15:36:43] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4061546 (10Dzahn) partially resolved because i still would like this one merged: https://gerrit.wikimedia.org/r/#/c/419084/ which allows us to skip systemd monit... [15:39:31] (03PS18) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [15:42:14] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#4061555 (10Dzahn) the script exists on the Icinga server and i was about to send a mail about it to the ops list, then i realized it needs a feature to "mail all but just... [15:42:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420364 (owner: 10Marostegui) [15:42:44] (03CR) 10Bstorm: "> Can you check if re.match does a "contains" and not a "full-string" [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [15:43:54] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420364 (owner: 10Marostegui) [15:48:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1091 (duration: 00m 59s) [15:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:08] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1091 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420364 (owner: 10Marostegui) [15:50:00] 10Operations, 10monitoring: Netbox: add Icinga check for PosgreSQL - https://phabricator.wikimedia.org/T185504#4061585 (10Dzahn) a:03Dzahn [15:51:43] (03PS2) 10Alexandros Kosiaris: Ignore /var/lib/kubelet in disk_space checks [puppet] - 10https://gerrit.wikimedia.org/r/420353 [15:53:11] 10Operations, 10Traffic, 10Patch-For-Review: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4061600 (10ema) Further investigation today showed that the cause of this is VCL staying in the `auto/busy` state. All those VCLs' probes keep on running. At a ce... [15:59:38] 10Operations, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#4061630 (10Volans) [15:59:42] 10Operations, 10monitoring: Netbox: add Icinga check for the website - https://phabricator.wikimedia.org/T185505#4061627 (10Volans) 05Open>03Resolved a:03Volans Agreed on the meeting that for now the simple HTTP check is enough, given that we check that the uWSGI web app is running too. [16:00:26] 10Operations, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3890758 (10Volans) [16:03:32] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore /var/lib/kubelet in disk_space checks [puppet] - 10https://gerrit.wikimedia.org/r/420353 (owner: 10Alexandros Kosiaris) [16:05:47] RECOVERY - mediawiki-installation DSH group on mw1296 is OK: OK [16:08:17] RECOVERY - Disk space on kubernetes1001 is OK: DISK OK [16:09:07] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [16:12:28] RECOVERY - Disk space on kubernetes1002 is OK: DISK OK [16:14:28] RECOVERY - Disk space on kubernetes2001 is OK: DISK OK [16:15:38] RECOVERY - Disk space on kubernetes2002 is OK: DISK OK [16:18:28] RECOVERY - Disk space on kubernetes2004 is OK: DISK OK [16:18:48] RECOVERY - Disk space on kubernetes2003 is OK: DISK OK [16:19:37] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4061727 (10Paladox) I've fixed upped https://gerrit.wikimedia.org/r/#/c/410245/ now, and now it works on stretch and je... [16:20:02] (03PS1) 10Papaul: DNS: Add mgmt DNS entries for mw2259-mw2290 [dns] - 10https://gerrit.wikimedia.org/r/420372 [16:26:53] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4061738 (10RobH) a:03mark [16:28:25] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4061741 (10Papaul) [16:31:10] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [16:35:54] (03CR) 10Paladox: "Ping, this has puppet errors, from irc" [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:36:08] (03PS5) 10Paladox: POC: Secure redirect service [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:36:41] (03CR) 10jerkins-bot: [V: 04-1] POC: Secure redirect service [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:39:35] (03PS6) 10Paladox: POC: Secure redirect service [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:39:54] (03CR) 10Paladox: "changed <%- nginx_cfg[domain].each do |path, target| -%> to <%- @nginx_cfg[domain].each do |path, target| -%>" [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:39:59] (03PS3) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) [16:40:06] (03CR) 10jerkins-bot: [V: 04-1] POC: Secure redirect service [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:51:18] (03PS4) 10Elukey: profile::analytics::cluster::client: add check for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) [16:57:50] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4061821 (10RobH) I've pinged @greg via IRC to try and determine what the preferred method of getting release engineering approval is for these. [17:00:04] gehel: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:02:53] (03PS3) 10Filippo Giunchedi: Stop using NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420342 [17:05:42] (03PS3) 10Jcrespo: dbproxy: Update m1 proxies to point to db1063 as the primary host [puppet] - 10https://gerrit.wikimedia.org/r/420317 (https://phabricator.wikimedia.org/T189655) [17:05:44] (03PS3) 10Jcrespo: mariadb: Switchover m1 master from db1016 to db1063 [puppet] - 10https://gerrit.wikimedia.org/r/420318 (https://phabricator.wikimedia.org/T189655) [17:05:46] (03PS3) 10Jcrespo: mariadb: Move default socket for misc services to /run [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) [17:06:01] (03PS4) 10Jcrespo: mariadb: Move default socket for misc services to /run [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) [17:06:23] (03CR) 10Dzahn: "requesttracker will not be removed, actually there is a ticket to upgrade it to stretch :p" [puppet] - 10https://gerrit.wikimedia.org/r/420342 (owner: 10Filippo Giunchedi) [17:08:18] mutante: doh! fix incoming [17:08:23] (03PS1) 10Filippo Giunchedi: requesttracker: remove NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420379 [17:08:27] mutante: ^ [17:09:09] (03CR) 10Dzahn: [C: 032] requesttracker: remove NameVirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/420379 (owner: 10Filippo Giunchedi) [17:09:10] ;) [17:09:33] thanks! [17:10:00] thank you, applying it on RT server [17:12:16] (03PS11) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [17:13:06] (03CR) 10Madhuvishy: [C: 032] NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [17:14:50] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/10510/" [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [17:15:24] (03CR) 10Jcrespo: "It *may* break heartbeat and could need to be restarted on misc masters." [puppet] - 10https://gerrit.wikimedia.org/r/420331 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [17:23:10] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/nfs-hostlist] [17:24:25] (03PS1) 10Madhuvishy: cumin: Move nfs-hostlist to right path [puppet] - 10https://gerrit.wikimedia.org/r/420381 [17:24:44] ^ i'm fixing the puppet thing [17:25:11] (03CR) 10Madhuvishy: [C: 032] cumin: Move nfs-hostlist to right path [puppet] - 10https://gerrit.wikimedia.org/r/420381 (owner: 10Madhuvishy) [17:25:45] oh, ooops, sorry about that [17:26:50] it's alright! I should have caught it [17:27:57] all good now :) [17:28:01] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:29:00] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/nfs-hostlist] [17:31:09] ^ recovery should be in in a sec [17:33:51] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:36:17] (03PS1) 10Elukey: Refactor stat1005's roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/420383 (https://phabricator.wikimedia.org/T167790) [17:37:00] madhuvishy: hey, yt? [17:41:58] (03PS8) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [17:42:33] (03CR) 10Ottomata: [C: 031] profile::analytics::cluster::client: add check for /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/420335 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [17:42:49] (03PS9) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [17:43:00] (03PS1) 10Reedy: Add advisorswiki to $private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/420385 (https://phabricator.wikimedia.org/T189181) [17:46:08] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4061999 (10ayounsi) [17:46:38] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4062000 (10ayounsi) [17:47:06] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4062003 (10ayounsi) [17:47:22] (03PS1) 10Reedy: Add advisorswiki to apache config [puppet] - 10https://gerrit.wikimedia.org/r/420386 (https://phabricator.wikimedia.org/T189181) [17:48:20] paravoid: hello! [17:48:58] oh hi [17:49:11] I think you have some kind of test on stat1006, right? [17:49:22] something about /srv/home/madhuvishy/dumpsmount-test? [17:49:31] I think it's breaking stuff, e.g. df -h doesn't work [17:49:45] (03PS1) 10Reedy: Add advisors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/420387 (https://phabricator.wikimedia.org/T189181) [17:50:08] paravoid: aah, yes must have broken with my nfs server reboot, fixed [17:50:30] thanks for the ping! [17:50:55] :) [17:51:21] foreign mounts can be dangerous in general (e.g. suid binaries, if you don't mount with nosuid/noexec etc.) [17:52:10] and stat1006 is kind of the danger zone too, because it has sensitive data and has a bunch of non-root users [17:53:10] so I'd recommend against testing in that server specifically, or if you do absolutely need to, to do it for short periods of time, monitor it, use !log for awareness etc. [17:53:37] paravoid: uhh yeah, this is just a read only dumps mount that will replace the one from dataset, i should have unmounted after making sure it worked post the vlan firewall changes [17:54:36] (03PS1) 10BBlack: Add more sleep delay on varnish restarts [puppet] - 10https://gerrit.wikimedia.org/r/420388 (https://phabricator.wikimedia.org/T189892) [17:54:38] (03PS1) 10BBlack: bump vcl reload delay by 3.6s [puppet] - 10https://gerrit.wikimedia.org/r/420389 (https://phabricator.wikimedia.org/T189892) [17:54:40] (03PS1) 10BBlack: Increase varnish probe interval to 1s [puppet] - 10https://gerrit.wikimedia.org/r/420390 (https://phabricator.wikimedia.org/T189892) [17:54:54] no worries, no harm done :) [17:57:54] (03PS5) 10Herron: naggen2: add support for puppetdb v4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) [17:58:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4062038 (10ayounsi) [17:58:48] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4062039 (10ayounsi) [17:59:03] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4062040 (10ayounsi) [17:59:13] (03PS2) 10Elukey: Refactor stat1005's roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/420383 (https://phabricator.wikimedia.org/T167790) [17:59:32] (03CR) 10Herron: [C: 032] naggen2: add support for puppetdb v4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T1800). [18:00:04] RoanKattouw and Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:25] here [18:01:00] I'll SWAT [18:01:39] RoanKattouw, ok, please ping me as soon as you will need something from me [18:01:59] (03CR) 10Catrope: [C: 032] Add more import sources to mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) (owner: 10Urbanecm) [18:02:03] (03PS3) 10Catrope: Add more import sources to mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) (owner: 10Urbanecm) [18:02:10] (03CR) 10Catrope: [C: 032] Add more import sources to mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) (owner: 10Urbanecm) [18:02:19] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10513/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/420383 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [18:02:46] (03CR) 10Elukey: "Andrew this is a quick draft, let me know if it makes sense or not :)" [puppet] - 10https://gerrit.wikimedia.org/r/420383 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [18:02:56] (03CR) 10RobH: [C: 032] Add advisors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/420387 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [18:03:26] (03PS7) 10Herron: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) [18:03:35] (03Merged) 10jenkins-bot: Add more import sources to mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) (owner: 10Urbanecm) [18:03:50] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - apiserver_request_latencies is 21059228 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:04:47] Urbanecm: Your import sources patch is on mwdebug1002, please test [18:04:50] RECOVERY - Request latencies on acrab is OK: OK - apiserver_request_latencies is 5457 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:05:04] (03CR) 10Herron: [C: 032] puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) (owner: 10Herron) [18:05:22] (03PS5) 10Urbanecm: Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) [18:06:10] RoanKattouw, well...impossible, I'm not an admin here nor global transwiki importer [18:06:18] If anybody here is, they can do it instead of me [18:06:27] Let me see if I am [18:06:47] Nope I'm not either [18:06:55] OK well it's only an import source, I'll just deploy it [18:08:26] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add enwiki and commons as import sources to mrwikisource (T188486) (duration: 00m 58s) [18:08:27] (03PS1) 10Vgutierrez: Fix memory leak when discarding labels [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/420395 [18:08:30] (03PS2) 10Catrope: Enable mapframe on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420077 (https://phabricator.wikimedia.org/T189883) (owner: 10Jayprakash12345) [18:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:32] T188486: Let administrators Import from more sources on Marathi wikisource - https://phabricator.wikimedia.org/T188486 [18:09:09] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-greg: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4062061 (10RobH) a:03greg @greg, I'm assigning this to you for your approval, as release engineering has to sign off on new deploy... [18:09:51] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - apiserver_request_latencies is 62889045 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:10:10] PROBLEM - Request latencies on acrux is CRITICAL: CRITICAL - apiserver_request_latencies is 51068673 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:10:33] (03CR) 10Catrope: [C: 032] Enable mapframe on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420077 (https://phabricator.wikimedia.org/T189883) (owner: 10Jayprakash12345) [18:10:36] RoanKattouw, did you find anything? [18:10:49] Everything seems to be fine [18:10:54] RoanKattouw, great. [18:11:06] I don't have import rights either so I can't actually go to Special:Import [18:11:52] (03Merged) 10jenkins-bot: Enable mapframe on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420077 (https://phabricator.wikimedia.org/T189883) (owner: 10Jayprakash12345) [18:14:27] RoanKattouw: Patch https://gerrit.wikimedia.org/r/#/c/420077/ working fine [18:14:36] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable mapframe on knwiki (T189883) (duration: 00m 58s) [18:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:42] T189883: Enable on kannada wikipedia - https://phabricator.wikimedia.org/T189883 [18:16:14] (03PS7) 10Catrope: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [18:16:19] (03CR) 10Catrope: [C: 032] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [18:17:38] (03Merged) 10jenkins-bot: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [18:19:10] RECOVERY - Request latencies on acrux is OK: OK - apiserver_request_latencies is 4189 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:19:50] RECOVERY - Request latencies on acrab is OK: OK - apiserver_request_latencies is 5480 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:20:28] (03PS1) 10Framawiki: New throttling exception for 2018-03-22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420397 (https://phabricator.wikimedia.org/T189796) [18:24:59] (03CR) 10Framawiki: New throttle rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420122 (https://phabricator.wikimedia.org/T189778) (owner: 10Urbanecm) [18:27:30] !log catrope@tin Synchronized dblists/: Uninstall Flow from wikis where it was never used (T188812) (duration: 00m 57s) [18:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:35] T188812: Uninstall Flow on all wikis where it has zero topics - https://phabricator.wikimedia.org/T188812 [18:27:51] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - apiserver_request_latencies is 21660230 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:28:30] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-greg: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4062113 (10greg) a:05greg>03RobH Approved. [18:28:32] (03PS2) 10Catrope: Enable $wgFlowReadOnly on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420131 (https://phabricator.wikimedia.org/T186463) [18:28:52] (03CR) 10Catrope: [C: 032] Enable $wgFlowReadOnly on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420131 (https://phabricator.wikimedia.org/T186463) (owner: 10Catrope) [18:29:00] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 42111774 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:29:10] PROBLEM - Request latencies on acrux is CRITICAL: CRITICAL - apiserver_request_latencies is 49747295 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:29:21] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 24831642 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:29:44] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4062116 (10greg) [18:29:51] RECOVERY - Request latencies on acrab is OK: OK - apiserver_request_latencies is 5429 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:30:01] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5897 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:30:04] (03Merged) 10jenkins-bot: Enable $wgFlowReadOnly on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420131 (https://phabricator.wikimedia.org/T186463) (owner: 10Catrope) [18:30:10] RECOVERY - Request latencies on acrux is OK: OK - apiserver_request_latencies is 4199 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:30:21] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 3908 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:32:20] (03CR) 10Zoranzoki21: [C: 031] Set $wgUploadNavigationUrl for few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [18:33:12] !log eevans@tin Started deploy [restbase/deploy@8dbc93c] (dev-cluster): bring dev environment current w/ production (T186751) [18:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:18] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [18:33:58] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable $wgFlowReadOnly on commonswiki (T186463) (duration: 00m 57s) [18:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:04] T186463: Uninstall Flow from Commons - https://phabricator.wikimedia.org/T186463 [18:34:11] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [18:37:28] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4062129 (10RobH) p:05Triage>03Normal [18:38:10] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4062149 (10RobH) a:03fgiunchedi I'd like to get @fgiunchedi's sign off on our racking proposal, since it affects when the new systems will use their 10G interfaces versus 1G interfaces. [18:41:15] (03CR) 10Mobrovac: "LGTM, one minor comment re the number of CPUs" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/420305 (owner: 10Alexandros Kosiaris) [18:42:35] !log smalyshev@tin Started deploy [wdqs/wdqs@d6bc746]: GUI update [18:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:14] (03CR) 10jenkins-bot: Add more import sources to mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) (owner: 10Urbanecm) [18:43:19] (03CR) 10jenkins-bot: Enable mapframe on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420077 (https://phabricator.wikimedia.org/T189883) (owner: 10Jayprakash12345) [18:43:23] (03CR) 10jenkins-bot: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [18:43:28] !log eevans@tin Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): bring dev environment current w/ production (T186751) (duration: 10m 16s) [18:43:29] (03CR) 10jenkins-bot: Enable $wgFlowReadOnly on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420131 (https://phabricator.wikimedia.org/T186463) (owner: 10Catrope) [18:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:35] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [18:44:59] !log smalyshev@tin Finished deploy [wdqs/wdqs@d6bc746]: GUI update (duration: 02m 24s) [18:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:37] 10Operations, 10ops-eqiad: setup backup1001.eqiad.wmnet - https://phabricator.wikimedia.org/T189801#4062174 (10Cmjohnson) [18:49:32] 10Operations, 10ops-eqiad: setup backup1001.eqiad.wmnet - https://phabricator.wikimedia.org/T189801#4053902 (10Cmjohnson) Moved this server to u11 on A8 once @akosiaris and I figure out a day/time to make the move I will relocated helium array to u 9/10. [18:51:57] (03PS2) 10Dzahn: DNS: Add mgmt DNS entries for mw2259-mw2290 [dns] - 10https://gerrit.wikimedia.org/r/420372 (owner: 10Papaul) [18:52:38] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS entries for mw2259-mw2290 [dns] - 10https://gerrit.wikimedia.org/r/420372 (owner: 10Papaul) [18:55:22] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4062196 (10RobH) [18:57:01] (03CR) 10Dzahn: [C: 032] "langcom approved https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Gorontalo" [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [18:57:23] (03PS6) 10Dzahn: Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [19:00:33] !log adding gor.wikipedia.org - new language Gorontalo https://www.ethnologue.com/language/gor | https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Gorontalo [19:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:35] !log DNS - authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones on ns servers to recreate zone files to add new language "gor" to langs.tmpl (T189109) [19:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:41] T189109: Create Wikipedia Gorontalo - https://phabricator.wikimedia.org/T189109 [19:03:40] Reedy or RoanKattouw or someone else, I could use a hand with a (possibly harmless) config issue on wikitech: https://phabricator.wikimedia.org/T189347 [19:04:02] Current theory: I'm not using the federated jobqueue, but something elsewhere in the config thinks I am. [19:04:16] AaronSchulz would probably know best offhand [19:04:22] Without me looking (I can in a few) [19:05:00] I'll take whoever I can get :) [19:05:10] No idea what that's about [19:05:15] Yes Aaron would be the best person to ask [19:05:44] ok, thanks Roan [19:05:48] !log eevans@tin Started deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) [19:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:54] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [19:06:03] andrewbogott: It's also quite possible you've found some weird edge case bug [19:06:51] Let me make a drink, and I'll have a poke/look [19:06:55] Reedy: that seems likely [19:07:43] (03CR) 10Gilles: navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [19:07:56] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4062213 (10RobH) [19:10:11] PROBLEM - Request latencies on acrux is CRITICAL: CRITICAL - apiserver_request_latencies is 25446500 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:11:11] RECOVERY - Request latencies on acrux is OK: OK - apiserver_request_latencies is 4209 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:12:12] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4062229 (10Dzahn) Can/Should i just try to install this one more time, as bast1003, @RobH, or are you already on it? [19:12:14] (03CR) 10BryanDavis: [C: 031] toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [19:12:44] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4062230 (10Dzahn) Its MAC should still be in install_server config. [19:14:17] !log eevans@tin Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) (duration: 08m 30s) [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [19:15:13] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4062241 (10Dzahn) a:05Cmjohnson>03Dzahn [19:16:36] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4062249 (10RobH) Chris swapped the bad disk out this AM: ``` Record: 6 Date/Time: 03/14/2018 14:35:50 So... [19:19:08] andrewbogott: Where is wikitech running from now? Can I still ssh onto it (like I did on silver)? [19:19:21] Reedy: labweb1001 and labweb1002 [19:19:28] and you should be able to connect as a deployer, same as on silver [19:20:35] Reedy: btw, silver still had some crons running that were spamming those logs, I think I've cleaned that up now [19:20:40] > var_dump( $wmgUseClusterJobqueue ); [19:20:40] bool(false) [19:21:21] So it shouldn't be including jobqueue.php [19:22:20] But then... This shouldn't have changed from silver [19:23:47] andrewbogott: so... [19:23:53] $wgCdnReboundPurgeDelay is 11 on labsweb1001 [19:23:56] It's 0 on silver [19:24:03] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4062289 (10Cmjohnson) @ayounsi FYI I deleted 2 interfaces from b4 cmjohnson@asw-b-eqiad# show |compare [edit interfaces] - ge-4/0/23 { - descri... [19:24:18] Reedy: that would do it. That must just be a default I inherited/copied from someplace? [19:24:25] Probably [19:24:45] !log upgraded compiler03.puppet3-diffs.eqiad.wmflabs (depooled) to puppetdb4/postgres backend [19:24:46] Default is 0 in DefaultSettings [19:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:58] reverse-proxy-miscweb.php [19:24:58] Is it in your new config on labsweb100[12]? [19:25:19] !log mobrovac@tin (no justification provided) [19:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:25] (03PS11) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [19:25:39] this is weird ^ [19:25:51] (03CR) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [19:25:54] wmgUseClusterSquid is false though on labsweb though [19:26:15] Reedy: it's because the wikitech-specific config has "require "$wmfConfigDir/reverse-proxy.php";" [19:26:20] Not sure why yet [19:26:21] ah [19:27:09] it actually looks like that include is probably correct... [19:27:27] https://github.com/wikimedia/operations-mediawiki-config/commit/1c664c757c014f0e386cb1acf9706079c33c1d8c [19:27:33] I can just override wgCdnReboundPurgeDelay in the next line unless you think that's harmful [19:27:48] I don't think so... [19:27:51] oh [19:27:57] I don't think it's harmful, that is [19:28:18] Just need to remember to remove it when wikitech eventually becomes a cluster wiki :) [19:28:31] remove that and a whole lot of other things [19:28:39] heh, indeed [19:29:25] (03PS4) 10Hoo man: Fix killing dumpers in Wikidata entity dumpers [puppet] - 10https://gerrit.wikimedia.org/r/393923 [19:29:49] (03PS1) 10Andrew Bogott: wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) [19:29:54] Reedy: ^ [19:30:04] (03PS1) 10Cmjohnson: Removing decom'd host mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/420407 (https://phabricator.wikimedia.org/T187446) [19:30:17] (03CR) 10Reedy: [C: 031] wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:30:38] (03CR) 10jerkins-bot: [V: 04-1] wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:30:41] (03CR) 10Cmjohnson: [C: 032] Removing decom'd host mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/420407 (https://phabricator.wikimedia.org/T187446) (owner: 10Cmjohnson) [19:31:13] Reedy: looks like we're between deploys so I'll merge that right now [19:31:26] lol [19:31:41] (03PS2) 10Reedy: wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:31:43] fixed :P [19:31:52] !log eevans@tin Started deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) [19:31:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4062299 (10Cmjohnson) [19:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:58] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [19:32:39] I will never be able to write working python and php on the same day [19:32:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 3 others: Decommission restbase-test environment - https://phabricator.wikimedia.org/T186755#4062303 (10Cmjohnson) [19:32:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#3975453 (10Cmjohnson) 05Open>03Resolved 2 of the 3 systems had ssds, the ssds were removed and replaced with wiped 500GB SATA disks. Resol... [19:32:56] (03CR) 10jerkins-bot: [V: 04-1] wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:33:39] (03PS3) 10Reedy: wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:33:43] (03CR) 10Gilles: [C: 031] navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [19:34:07] It should always take at least 5 patchsets for a one-liner :( [19:35:22] (03PS1) 10Madhuvishy: slow-parse: Turn off rsync from mwlog1001 to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420408 (https://phabricator.wikimedia.org/T189284) [19:35:26] (03CR) 10Andrew Bogott: [C: 032] wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:35:51] (03CR) 10jerkins-bot: [V: 04-1] slow-parse: Turn off rsync from mwlog1001 to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420408 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [19:36:13] (03PS1) 10Awight: Install git-lfs on scap source and target [puppet] - 10https://gerrit.wikimedia.org/r/420409 (https://phabricator.wikimedia.org/T180628) [19:36:49] 10Operations, 10hardware-requests: Reclaim/Decommission Silver.wikimedia.org - https://phabricator.wikimedia.org/T190085#4062308 (10Andrew) [19:36:51] (03Merged) 10jenkins-bot: wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:36:53] (03PS2) 10Madhuvishy: slow-parse: Turn off rsync from mwlog1001 to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420408 (https://phabricator.wikimedia.org/T189284) [19:37:10] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4062317 (10awight) The catch seems to be that tin and deployment-tin are running jessie, so the git-lfs package isn't easily available. [19:37:23] 10Operations, 10hardware-requests: Reclaim/Decommission Silver.wikimedia.org - https://phabricator.wikimedia.org/T190085#4062308 (10Andrew) [19:37:29] 10Operations, 10Cloud-Services, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#4062322 (10Andrew) [19:37:53] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#3764041 (10Dzahn) tin will be replaced very soon [19:38:52] !log andrew@tin Synchronized wmf-config/wikitech.php: fix for T189347 (duration: 00m 57s) [19:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:57] T189347: Possible issues with varnish purging on wikitech - https://phabricator.wikimedia.org/T189347 [19:40:31] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4062334 (10Dzahn) a:05RobH>03Dzahn Chris also replaced disks in this one today. [19:41:51] (03CR) 10jenkins-bot: wikitech: set wgCdnReboundPurgeDelay to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420406 (https://phabricator.wikimedia.org/T189347) (owner: 10Andrew Bogott) [19:42:20] !log eevans@tin Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) (duration: 10m 28s) [19:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:28] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [19:42:43] (03PS1) 10Madhuvishy: slowparse: Remove code for rsync to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420410 (https://phabricator.wikimedia.org/T189284) [19:44:16] (03PS1) 10Madhuvishy: dumps: Absent slowparse logs rsync config [puppet] - 10https://gerrit.wikimedia.org/r/420411 (https://phabricator.wikimedia.org/T189284) [19:48:56] (03PS1) 10Madhuvishy: dumps: Remove slowparse rsync related code [puppet] - 10https://gerrit.wikimedia.org/r/420415 (https://phabricator.wikimedia.org/T189284) [19:53:37] !log andrew@tin Synchronized wmf-config/wikitech.php: fix for T189347 take 2 (duration: 00m 57s) [19:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:43] T189347: Possible issues with varnish purging on wikitech - https://phabricator.wikimedia.org/T189347 [19:53:53] 10Operations, 10hardware-requests: Decommission old server wmf4077 - https://phabricator.wikimedia.org/T190086#4062397 (10Peachey88) [19:54:46] Reedy: that change (now that it's actually applied) seems to have quieted down the log. Thank you! [19:54:52] yay [19:56:13] (03PS1) 10Mobrovac: RESTBase: Add the correct seeds for the dev environment [puppet] - 10https://gerrit.wikimedia.org/r/420416 (https://phabricator.wikimedia.org/T186751) [19:56:30] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4062410 (10awight) [19:56:35] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4062409 (10awight) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:17] No ORES fun today. [20:01:59] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/compiler02/10514/restbase-dev1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/420416 (https://phabricator.wikimedia.org/T186751) (owner: 10Mobrovac) [20:07:25] 10Operations, 10Traffic, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4062444 (10Krinkle) [20:08:00] (03PS1) 10Papaul: DNS: Add production DNS entries for mw2259-mw2290 [dns] - 10https://gerrit.wikimedia.org/r/420425 [20:11:51] (03PS19) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [20:13:33] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4062460 (10ayounsi) >>! In T183585#4062289, @Cmjohnson wrote: > @ayounsi FYI I deleted 2 interfaces from b4 asw2-b updated. [20:14:23] !log discarding unused vcl on all cp backends, 1-at-a-time [20:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:51] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4062475 (10Papaul) @joe @MoritzMuehlenhoff the last mw server me2258 has Jessie installed on it. Are we doing Stretch on the new onces or keep installing Jessie? [20:19:02] (03PS2) 10BBlack: Add more sleep delay on varnish restarts [puppet] - 10https://gerrit.wikimedia.org/r/420388 (https://phabricator.wikimedia.org/T189892) [20:19:04] (03PS2) 10BBlack: bump vcl reload delay by 3.6s [puppet] - 10https://gerrit.wikimedia.org/r/420389 (https://phabricator.wikimedia.org/T189892) [20:19:06] (03PS2) 10BBlack: Increase varnish probe interval to 1s [puppet] - 10https://gerrit.wikimedia.org/r/420390 (https://phabricator.wikimedia.org/T189892) [20:19:08] (03PS1) 10BBlack: auto-discard vcls when reloading [puppet] - 10https://gerrit.wikimedia.org/r/420432 (https://phabricator.wikimedia.org/T189892) [20:19:59] !log discarding unused vcl on all cp frontends, 1-at-a-time [20:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:07] (03PS4) 10Bstorm: toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) [20:21:14] (03CR) 10BBlack: [C: 032] auto-discard vcls when reloading [puppet] - 10https://gerrit.wikimedia.org/r/420432 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:21:18] (03CR) 10BBlack: [C: 032] Add more sleep delay on varnish restarts [puppet] - 10https://gerrit.wikimedia.org/r/420388 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:21:27] (03CR) 10Bstorm: [C: 032] toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [20:21:29] (03CR) 10BBlack: [C: 032] bump vcl reload delay by 3.6s [puppet] - 10https://gerrit.wikimedia.org/r/420389 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:22:11] hmmm [20:22:26] we may have hit an edge case with "submit-including-parents" on my part, bstorm_ [20:22:55] bstorm_: I think yours is already merged on the the puppetmaster, right? [20:22:57] Oh? [20:23:06] (03PS2) 10BBlack: auto-discard vcls when reloading [puppet] - 10https://gerrit.wikimedia.org/r/420432 (https://phabricator.wikimedia.org/T189892) [20:23:07] I just puppet-merged [20:23:18] I only saw my commit, though [20:23:27] ok, I can fix mine then, thanks! [20:23:34] (03CR) 10BBlack: [V: 032 C: 032] auto-discard vcls when reloading [puppet] - 10https://gerrit.wikimedia.org/r/420432 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:23:43] Ok, sorry if I caused a problem with timing! [20:23:53] (03PS3) 10BBlack: Add more sleep delay on varnish restarts [puppet] - 10https://gerrit.wikimedia.org/r/420388 (https://phabricator.wikimedia.org/T189892) [20:24:07] it's unavoidable so long as we're all working in parallel :) [20:24:11] (03CR) 10BBlack: [V: 032 C: 032] Add more sleep delay on varnish restarts [puppet] - 10https://gerrit.wikimedia.org/r/420388 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:24:25] (03PS3) 10BBlack: bump vcl reload delay by 3.6s [puppet] - 10https://gerrit.wikimedia.org/r/420389 (https://phabricator.wikimedia.org/T189892) [20:24:28] (03CR) 10BBlack: [V: 032 C: 032] bump vcl reload delay by 3.6s [puppet] - 10https://gerrit.wikimedia.org/r/420389 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:26:42] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4062499 (10Dzahn) I started the install process again and it went through partitioning and installing the base system, so we can confirm the hard disks themselves work. That being said, the installati... [20:28:35] (03PS1) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:30:15] (03PS2) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:30:45] (03PS3) 10BBlack: Increase varnish probe interval to 1s [puppet] - 10https://gerrit.wikimedia.org/r/420390 (https://phabricator.wikimedia.org/T189892) [20:32:06] !log signing puppet certs for new host bast1002. initial puppet run, will replace bast1001 soon (T186623) [20:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:12] T186623: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623 [20:32:52] (03PS3) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:33:45] (03CR) 10BBlack: [C: 032] Increase varnish probe interval to 1s [puppet] - 10https://gerrit.wikimedia.org/r/420390 (https://phabricator.wikimedia.org/T189892) (owner: 10BBlack) [20:33:47] (03PS2) 10Dzahn: site: turn bast1002 into a bastion host [puppet] - 10https://gerrit.wikimedia.org/r/414848 (https://phabricator.wikimedia.org/T186623) [20:35:30] 10Operations, 10Puppet, 10Release-Engineering-Team: puppetdb4: use postgres db backend in puppet-compiler - https://phabricator.wikimedia.org/T187258#4062512 (10herron) Compiler02 has been upgraded to puppetdb 4 and populate-puppetdb kicked off. The first few dozen hosts have compiled/populated successfully... [20:36:52] (03PS4) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:42:22] (03Abandoned) 10Madhuvishy: uwsgi: Allow specifying plugins as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 (owner: 10Madhuvishy) [20:42:53] (03PS5) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:45:58] (03PS6) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:47:12] 10Operations, 10Traffic, 10netops: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4062522 (10ayounsi) p:05Triage>03Normal [20:49:36] (03CR) 10Rush: "labtestcontrol2003.wikimedia.org,labneutron2001.codfw.wmnet,labtestvirt2003.codfw.wmnet,labtestmetal2001.codfw.wmnet,labneutron2002.codfw." [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:49:54] (03PS7) 10Rush: openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) [20:58:51] (03CR) 10Rush: [C: 032] openstack: ml2 and linuxbridge_agent setup [puppet] - 10https://gerrit.wikimedia.org/r/420433 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:00:04] bawolff and Reedy: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:01:23] (03PS3) 10Dzahn: site: turn bast1002 into a bastion host [puppet] - 10https://gerrit.wikimedia.org/r/414848 (https://phabricator.wikimedia.org/T186623) [21:01:38] (03CR) 10Dzahn: [C: 032] site: turn bast1002 into a bastion host [puppet] - 10https://gerrit.wikimedia.org/r/414848 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [21:04:36] (03CR) 10Krinkle: [C: 031] "Confirmed. WikidataOrg.php basically just calls wfLoadExtension( 'Wikidata.org' ) now. And this was merged in November 2017, so should be " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 (owner: 10Addshore) [21:05:26] RECOVERY - Check systemd state on labtestmetal2001 is OK: OK - running: The system is fully operational [21:06:33] anyone mind if I add in two patches for the swat? [21:07:42] (03CR) 10Krinkle: partman: fix recipes for bastion servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/414882 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [21:08:04] mutante: just double-checking, feel free to ignore ^ :) [21:09:48] Reedy? [21:10:06] I'm not SWAT-ing... [21:12:20] huh [21:12:20] jouncebot seems to think you are? [21:12:24] 1:00:06 PM <+jouncebot> bawolff and Reedy: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2100). [21:12:34] No, it's not a swat window [21:13:02] Krinkle: yea, i think it was intended.. but your comment did make me question it again.. i'll check with others [21:13:08] oh, silly me [21:13:27] jouncebot: now [21:13:27] For the next 1 hour(s) and 46 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2100) [21:13:35] jouncebot: next [21:13:35] In 1 hour(s) and 46 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2300) [21:13:56] That's the window you want, Cupid.^ [21:15:17] ah, thanks [21:23:02] * Krinkle is doing manual tests on mwdebug1001 unrelated to SWAT [21:31:05] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:26] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:35] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:35] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:45] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:55] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:32:05] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:32:15] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:32:55] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:32:55] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:05] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:05] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:16] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:16] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:16] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:36] yes, puppetdb. it got auto-restarted 5 min. yes, it will recover [21:35:05] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:25] (03PS1) 10Krinkle: mediawiki: Fix preg_match bug in furl causing bad redirect and E_NOTICE [puppet] - 10https://gerrit.wikimedia.org/r/420601 [21:36:50] (03PS2) 10Krinkle: mediawiki: Fix preg_match bug in furl causing bad redirect and E_NOTICE [puppet] - 10https://gerrit.wikimedia.org/r/420601 [21:37:26] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4062668 (10Dzahn) [21:38:16] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:38:36] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#4062671 (10Dzahn) [21:38:41] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3949728 (10Dzahn) 05Open>03Resolved handed-over to self ;) [21:41:40] (03CR) 10Krinkle: [C: 031] slow-parse: Turn off rsync from mwlog1001 to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420408 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [21:42:00] (03PS2) 10Krinkle: slow-parse: Remove code for rsync to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420410 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [21:42:03] (03CR) 10Krinkle: [C: 031] slow-parse: Remove code for rsync to dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/420410 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [21:42:15] (03CR) 10Krinkle: [C: 031] dumps: Absent slowparse logs rsync config [puppet] - 10https://gerrit.wikimedia.org/r/420411 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [21:42:43] (03CR) 10Krinkle: [C: 031] dumps: Remove slowparse rsync related code [puppet] - 10https://gerrit.wikimedia.org/r/420415 (https://phabricator.wikimedia.org/T189284) (owner: 10Madhuvishy) [21:43:03] 10Operations: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#4062707 (10Dzahn) unblocked now, OS install was now succesful after disks have been changed [21:46:45] (03CR) 10Eevans: [C: 031] "Not production; Only applicable to the dev environment (which is already down)." [puppet] - 10https://gerrit.wikimedia.org/r/420416 (https://phabricator.wikimedia.org/T186751) (owner: 10Mobrovac) [21:49:49] 10Operations, 10netops: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4062787 (10ayounsi) [21:53:01] (03Abandoned) 10Hashar: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 (owner: 10Hashar) [21:53:37] (03Abandoned) 10Hashar: (WIP) Puppet compile an host via rspec (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/308889 (owner: 10Hashar) [21:54:49] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4062832 (10MoritzMuehlenhoff) >>! In T188301#4062475, @Papaul wrote: > @joe @MoritzMuehlenhoff the last mw server me2258 has Jessie installed on it. Are we doing Stretc... [21:58:16] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:00:02] (03PS2) 10Hashar: admin: contint-admins to restart Jenkins via systemd [puppet] - 10https://gerrit.wikimedia.org/r/408555 [22:00:05] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:01:05] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:01:25] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:01:36] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:01:36] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:01:45] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:01:55] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:02:05] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:02:15] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:02:22] (03CR) 10Hashar: [C: 031] "Looks like we forgot about this change. That is to let contint-admins members to restart jenkins via systemctl." [puppet] - 10https://gerrit.wikimedia.org/r/408555 (owner: 10Hashar) [22:02:44] (03CR) 1020after4: [C: 031] mediawiki: Fix preg_match bug in furl causing bad redirect and E_NOTICE [puppet] - 10https://gerrit.wikimedia.org/r/420601 (owner: 10Krinkle) [22:02:55] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:02:55] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:03:05] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:03:05] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:03:16] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:03:39] (03Abandoned) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [22:03:59] (03Abandoned) 10Hashar: Fix nrpe spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/419410 (owner: 10Hashar) [22:09:08] (03CR) 10Dzahn: [C: 032] partman: fix recipes for bastion servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/414882 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [22:13:27] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4062894 (10Dzahn) 05Open>03Resolved a:03Dzahn Thank you for your work on this, Paladox. I'll call it resolved the... [22:13:31] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4062897 (10Dzahn) [22:13:35] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4062899 (10Dzahn) a:05Dzahn>03Paladox [22:14:16] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4062905 (10Paladox) @Dzahn though the change wasen't merged yet https://gerrit.wikimedia.org/r/#/c/410245/ [22:14:23] (03PS20) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [22:14:58] (03PS5) 10Dzahn: microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) [22:15:24] (03CR) 10Dzahn: [C: 032] microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [22:18:10] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4062926 (10Dzahn) Eh.. then please define "stretch support is working" [22:18:14] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4062927 (10Dzahn) 05Resolved>03Open [22:18:18] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4062928 (10Dzahn) [22:19:09] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4062930 (10Paladox) @Dzahn i cherry picked this https://gerrit.wikimedia.org/r/#/c/410245/ onto my local puppet master a... [22:28:02] (03CR) 1020after4: [C: 031] "Looks good, I haven't tested it but the changes look sane" [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [22:30:52] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4062972 (10Dzahn) a:03Dzahn [22:31:05] (03CR) 1020after4: [C: 031] "http://puppet-compiler.wmflabs.org/10518/ looks good as well" [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [22:31:11] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Dzahn) p:05Normal>03High [22:31:16] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Dzahn) 05stalled>03Open [22:33:21] (03PS1) 10Dzahn: misc::cache: add design.wikimedia.org req_handling rule [puppet] - 10https://gerrit.wikimedia.org/r/420604 (https://phabricator.wikimedia.org/T185282) [22:34:47] (03CR) 10Dzahn: [C: 032] misc::cache: add design.wikimedia.org req_handling rule [puppet] - 10https://gerrit.wikimedia.org/r/420604 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [22:36:53] (03PS2) 10Dzahn: [Italian Planet] Add User:Sciking blog to it.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/420243 (owner: 10Nemo bis) [22:37:10] (03CR) 10Dzahn: [C: 032] [Italian Planet] Add User:Sciking blog to it.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/420243 (owner: 10Nemo bis) [22:38:27] (03PS1) 10Ayounsi: Add deploy script to link config with where netbox expects it [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/420605 [22:38:31] (03PS2) 10Dzahn: RESTBase: Add the correct seeds for the dev environment [puppet] - 10https://gerrit.wikimedia.org/r/420416 (https://phabricator.wikimedia.org/T186751) (owner: 10Mobrovac) [22:38:42] (03CR) 10Dzahn: [C: 032] "dev-only" [puppet] - 10https://gerrit.wikimedia.org/r/420416 (https://phabricator.wikimedia.org/T186751) (owner: 10Mobrovac) [22:39:50] (03CR) 10Dzahn: [C: 032] "merged but not applied since restbase-dev has disabled puppet with reason rebuilding cluster" [puppet] - 10https://gerrit.wikimedia.org/r/420416 (https://phabricator.wikimedia.org/T186751) (owner: 10Mobrovac) [22:40:43] (03PS1) 10Ayounsi: Netbox: Move configuration to /etc/ [puppet] - 10https://gerrit.wikimedia.org/r/420607 [22:41:03] (03PS2) 10Ayounsi: Netbox: Move configuration to /etc/ [puppet] - 10https://gerrit.wikimedia.org/r/420607 [22:47:08] (03PS2) 1020after4: Bump scap package version to 3.7.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/417943 (https://phabricator.wikimedia.org/T189306) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2300). [23:00:04] tgr and Cupid: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:15:50] (03PS1) 10Dzahn: design.wm.org: add Apache alias config for style-guide [puppet] - 10https://gerrit.wikimedia.org/r/420612 (https://phabricator.wikimedia.org/T185282) [23:16:23] (03PS2) 10Dzahn: design.wm.org: add Apache alias config for style-guide [puppet] - 10https://gerrit.wikimedia.org/r/420612 (https://phabricator.wikimedia.org/T185282) [23:16:47] (03CR) 10Dzahn: [C: 032] "tested on bromine" [puppet] - 10https://gerrit.wikimedia.org/r/420612 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [23:20:33] (03CR) 10Ayounsi: [C: 032] Netbox: Move configuration to /etc/ [puppet] - 10https://gerrit.wikimedia.org/r/420607 (owner: 10Ayounsi) [23:20:35] (03PS3) 10Ayounsi: Netbox: Move configuration to /etc/ [puppet] - 10https://gerrit.wikimedia.org/r/420607 [23:20:45] (03CR) 10Ayounsi: [V: 032 C: 032] Add deploy script to link config with where netbox expects it [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/420605 (owner: 10Ayounsi) [23:21:25] (03PS1) 10Dzahn: add design.wikimedia.org, point to misc-web-cache [dns] - 10https://gerrit.wikimedia.org/r/420613 (https://phabricator.wikimedia.org/T185282) [23:22:09] (03CR) 10Dzahn: [C: 032] add design.wikimedia.org, point to misc-web-cache [dns] - 10https://gerrit.wikimedia.org/r/420613 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [23:25:16] so who is doing the swat? [23:27:02] !log ayounsi@tin Started deploy [netbox/deploy@bed8da1]: Fixing netbox deploy issue [23:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:38] !log ayounsi@tin Finished deploy [netbox/deploy@bed8da1]: Fixing netbox deploy issue (duration: 00m 37s) [23:27:40] No one apparently [23:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:48] jouncebot: now [23:27:48] For the next 0 hour(s) and 32 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180319T2300) [23:28:07] tgr: About? [23:28:39] Cupid: https://gerrit.wikimedia.org/r/#/c/267550/ shouldn't be in swat. It's not been reviewed/merged [23:29:26] I thought it got merged during swat? [23:29:27] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4063154 (10Dzahn) 05Open>03Resolved https://design.wikimedia.org/ https://design.wikimedia.org/style-guide/ [23:29:51] Config patches do [23:29:58] Cherry pick to deployment branches of MW do [23:30:45] It's not right anyway [23:32:21] (03PS2) 10Reedy: Start renaming $wmfRealm to $wmgRealm in MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417215 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [23:32:43] (03PS3) 10Reedy: Start renaming $wmfRealm to $wmgRealm in MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417215 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [23:32:47] (03CR) 10Reedy: [C: 032] Start renaming $wmfRealm to $wmgRealm in MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417215 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [23:34:11] (03Merged) 10jenkins-bot: Start renaming $wmfRealm to $wmgRealm in MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417215 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [23:35:34] !log ayounsi@tin Started deploy [netbox/deploy@f7faa04]: Fixing netbox deploy issue [23:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:45] !log reedy@tin Synchronized multiversion/MWRealm.php: T45956 (duration: 00m 57s) [23:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:51] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [23:35:52] (03PS2) 10Reedy: Log ReadingLists warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420155 (https://phabricator.wikimedia.org/T189340) (owner: 10Gergő Tisza) [23:35:56] (03CR) 10Reedy: [C: 032] Log ReadingLists warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420155 (https://phabricator.wikimedia.org/T189340) (owner: 10Gergő Tisza) [23:36:12] !log ayounsi@tin Finished deploy [netbox/deploy@f7faa04]: Fixing netbox deploy issue (duration: 00m 38s) [23:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:36] (03PS2) 10Dzahn: DNS: Add production DNS entries for mw2259-mw2290 [dns] - 10https://gerrit.wikimedia.org/r/420425 (owner: 10Papaul) [23:38:07] (03Merged) 10jenkins-bot: Log ReadingLists warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420155 (https://phabricator.wikimedia.org/T189340) (owner: 10Gergő Tisza) [23:39:07] Reedy: sorry, looks my IRC notification stack got broken [23:39:13] (03CR) 10jenkins-bot: Start renaming $wmfRealm to $wmgRealm in MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417215 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [23:39:28] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Log ReadingLists warning (duration: 00m 58s) [23:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:43] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4053963 (10Dzahn) [23:40:44] (03PS5) 10Reedy: Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:40:48] (03CR) 10Reedy: [C: 032] Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:41:07] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4053963 (10Dzahn) a:05Dzahn>03RobH Please also see the duplicate task i first created and then merged in (T190093) [23:42:03] (03Merged) 10jenkins-bot: Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:43:38] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable Wikidata description override on testwiki (duration: 00m 58s) [23:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:12] (03CR) 10Dzahn: [C: 032] DNS: Add production DNS entries for mw2259-mw2290 [dns] - 10https://gerrit.wikimedia.org/r/420425 (owner: 10Papaul) [23:45:22] (03PS2) 10Reedy: Allow protocol-relative URLs in TemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [23:45:28] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Volker_E) Thanks @Dzahn for the quick and thorough work on this! [23:48:52] (03CR) 10Reedy: "Not deployed in SWAT as no tgr to test it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [23:50:36] Reedy: can I test it now? [23:50:50] heh, thought you weren't about as you didn't respond when pinged :P [23:50:53] Can if you want [23:51:03] (03CR) 10Reedy: [C: 032] Allow protocol-relative URLs in TemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [23:51:15] yeah, my ping notification breaks every once in a while [23:51:45] heh [23:51:57] I just did the other two because one isn't testable, and the other is just testwiki ;) [23:52:29] 10Operations, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4063260 (10RobH) a:05Cmjohnson>03None [23:52:31] (03Merged) 10jenkins-bot: Allow protocol-relative URLs in TemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [23:53:33] uh, apparently this isn't testable either [23:54:04] I thought I'd get warnings on preview if the URL filter does not match, but apparently it only happens on save [23:54:59] I could try to save it via mwdebug, but that seems a bit disruptive [23:55:14] 10Operations, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4063265 (10Dzahn) a:03Dzahn [23:56:42] tgr: it's not testable? [23:56:48] I hadn't pulled it onto mwdebug1001 yet... [23:57:24] well, not unless saving something via mwdebug that will be invalid content for all other hosts is a good idea [23:58:29] 10Operations, 10Performance-Team, 10Wikimedia-Apache-configuration: VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063273 (10Krinkle) [23:58:34] I'll just sync it hten [23:59:12] yes, please