[00:00:00] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:04:03] (03CR) 10Chad: "Couple of minor inlines, but otherwise lgtm" (033 comments) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414598 (owner: 10Paladox) [00:05:23] (03PS5) 10Paladox: Add BUILD files to build plugin [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414598 [00:05:25] (03CR) 10Paladox: Add BUILD files to build plugin (033 comments) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414598 (owner: 10Paladox) [00:11:13] (03CR) 10Chad: [V: 032 C: 032] Add BUILD files to build plugin [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414598 (owner: 10Paladox) [00:19:26] (03PS1) 10Chad: Add a few more things to gitignore [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414599 [00:30:15] (03PS2) 10Chad: Add a few more things to gitignore [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414599 [00:30:39] (03CR) 10Paladox: [V: 032 C: 032] Add a few more things to gitignore [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414599 (owner: 10Chad) [00:57:43] (03PS1) 10Chad: Adding symlink to about.md for README.md [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414604 [00:58:18] (03PS2) 10Paladox: Adding symlink to about.md for README.md [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414604 (owner: 10Chad) [00:58:22] (03CR) 10Paladox: [V: 032 C: 032] Adding symlink to about.md for README.md [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414604 (owner: 10Chad) [01:06:10] (03PS1) 10Chad: Basic bootstrapping for Github project creation listener [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414605 [01:22:16] (03CR) 10Paladox: Basic bootstrapping for Github project creation listener (031 comment) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414605 (owner: 10Chad) [01:25:26] (03CR) 10Chad: Basic bootstrapping for Github project creation listener (031 comment) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414605 (owner: 10Chad) [01:31:16] (03PS2) 10Chad: Basic bootstrapping for Github project creation listener [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414605 [01:31:54] (03CR) 10Paladox: [V: 032 C: 032] Basic bootstrapping for Github project creation listener [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414605 (owner: 10Chad) [02:15:42] !log disabling ALGs on MR routers [02:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:55] (03PS1) 10Chad: WIP: Adding a "Deployed to" bit for the "Included In" header [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414607 [02:30:30] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [02:31:30] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [02:43:50] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.22) (duration: 07m 12s) [02:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:49] !log labs->cloud vlan rename in codfw - T187933 [02:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:04] T187933: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933 [03:01:11] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:15:50] (03CR) 10Krinkle: "Needs careful testing by someone who knows how to test these endpoints on mwdebug (or beta). I personally don't know." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414310 (owner: 10Umherirrender) [03:26:51] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [03:29:51] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:31:11] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:27:11] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [04:30:11] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:09:01] (03PS1) 10Legoktm: ExtensionDistributor: Ignore empty repositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414612 [05:26:40] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [05:29:40] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:15:11] !log Stop MySQL on db1115 tendril database to copy it to db2093. Tendril (dbtree) service will be down for maintenance - T184704 [06:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:26] T184704: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704 [06:29:51] (03PS1) 10Marostegui: db-codfw.php: Depool db2070 and db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414614 [06:31:39] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2070 and db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414614 (owner: 10Marostegui) [06:33:07] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070 and db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414614 (owner: 10Marostegui) [06:33:21] (03CR) 10jenkins-bot: db-codfw.php: Depool db2070 and db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414614 (owner: 10Marostegui) [06:35:07] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2055 and db2070 (duration: 01m 07s) [06:35:15] !log Stop MySQL db2070 and db2055 to copy data to db2055 (and upgrade kernel and mariadb) [06:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414617 (https://phabricator.wikimedia.org/T187089) [06:50:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414617 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:52:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414617 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:52:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414617 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:53:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 (duration: 00m 56s) [06:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414618 [06:55:28] (03CR) 10Elukey: "> I think it would be nicer if you do this first: https://gerrit.wikimedia.org/r/413889" [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [06:56:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414618 (owner: 10Marostegui) [06:58:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414618 (owner: 10Marostegui) [06:59:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 (duration: 00m 54s) [06:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:26] !log Stop MySQL on db1103:3312 and 3314 to upgrade it and kernel [06:59:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414618 (owner: 10Marostegui) [06:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:04] !log Deploy schema change on db1103:3312 - T187089 T185128 T153182 [07:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:18] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [07:08:18] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [07:08:19] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [07:11:21] (03PS10) 10Elukey: Introduce role::kafka::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/413728 (https://phabricator.wikimedia.org/T187805) [07:14:03] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414619 [07:17:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414619 (owner: 10Marostegui) [07:19:19] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414619 (owner: 10Marostegui) [07:19:30] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414619 (owner: 10Marostegui) [07:20:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1103:3314 after mariadb and kernel upgrade (duration: 00m 56s) [07:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:14] (03CR) 10Elukey: "pcc after rebase: https://puppet-compiler.wmflabs.org/compiler02/10135/" [puppet] - 10https://gerrit.wikimedia.org/r/413728 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [07:32:34] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414621 [07:34:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414621 (owner: 10Marostegui) [07:34:17] (03PS2) 10Elukey: role::configcluster: update zookeeper's ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) [07:34:23] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10137/" [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [07:35:31] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414621 (owner: 10Marostegui) [07:36:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic db1103:3314 (duration: 00m 56s) [07:36:51] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414621 (owner: 10Marostegui) [07:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:47] (03CR) 10Elukey: [C: 032] role::configcluster: update zookeeper's ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/413685 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [07:45:44] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414622 [07:48:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414622 (owner: 10Marostegui) [07:49:55] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414622 (owner: 10Marostegui) [07:50:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414622 (owner: 10Marostegui) [07:51:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic db1103:3314 (duration: 00m 56s) [07:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:37] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414623 [08:05:30] (03CR) 10Elukey: [C: 032] Introduce role::kafka::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/413728 (https://phabricator.wikimedia.org/T187805) (owner: 10Elukey) [08:05:35] (03PS11) 10Elukey: Introduce role::kafka::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/413728 (https://phabricator.wikimedia.org/T187805) [08:09:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414623 (owner: 10Marostegui) [08:10:30] PROBLEM - puppet last run on kafkamon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[burrow] [08:10:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 (duration: 00m 56s) [08:10:33] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414623 (owner: 10Marostegui) [08:10:41] PROBLEM - puppet last run on kafkamon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[burrow] [08:10:43] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414623 (owner: 10Marostegui) [08:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:15] failures on kafkamon are mine, burrow is not on the stretch apt repo [08:11:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1103:3314 (duration: 00m 56s) [08:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:22] (03CR) 10Muehlenhoff: [C: 032] Fix verbose logging in debdeploy-deploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413758 (owner: 10Muehlenhoff) [08:35:41] (03PS3) 10Jcrespo: tendril: Add memcache to tendril web frontend [puppet] - 10https://gerrit.wikimedia.org/r/414502 (https://phabricator.wikimedia.org/T133906) [08:38:25] 10Operations: TransparencyReport-private is not auto deploying - https://phabricator.wikimedia.org/T188224#4000455 (10Peachey88) >>! In T188224#4000339, @Prtksxna wrote: > Also, would it be possible to move https://transparency.wikimedia.org/private to https://private.transparency.wikimedia.org/. Happy to raise... [08:43:45] (03CR) 10Jcrespo: [C: 032] tendril: Add memcache to tendril web frontend [puppet] - 10https://gerrit.wikimedia.org/r/414502 (https://phabricator.wikimedia.org/T133906) (owner: 10Jcrespo) [08:45:38] (03PS4) 10Gehel: Increas bulk insert threadpool for relforge [puppet] - 10https://gerrit.wikimedia.org/r/413810 (owner: 10EBernhardson) [08:46:45] (03CR) 10Gehel: [C: 032] "LGTM (and puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/10138/)" [puppet] - 10https://gerrit.wikimedia.org/r/413810 (owner: 10EBernhardson) [08:48:40] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:48:40] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:48:41] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 11892.24 ms [08:48:41] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2930.83 ms [08:48:41] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4503.55 ms [08:48:41] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:48:41] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:48:55] i guess ganeti1006 down :) [08:49:00] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 4426.14 ms [08:49:03] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 8%, RTA = 4510.35 ms [08:49:10] RECOVERY - Host chlorine is UP: PING WARNING - Packet loss = 0%, RTA = 1891.89 ms [08:49:11] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2671.33 ms [08:49:20] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 7861.66 ms [08:49:30] RECOVERY - Host hassium is UP: PING WARNING - Packet loss = 73%, RTA = 1379.77 ms [08:50:41] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:50] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:51:09] (03PS3) 10Gehel: Resize the Cirrus LTR model cache [puppet] - 10https://gerrit.wikimedia.org/r/413407 (https://phabricator.wikimedia.org/T188015) (owner: 10EBernhardson) [08:51:31] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:51:51] PROBLEM - SSH on releases1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:51:51] PROBLEM - HTTP on releases1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:48] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2070 and db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414627 [08:54:50] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:54:51] RECOVERY - Host webperf1001 is UP: PING WARNING - Packet loss = 80%, RTA = 180.23 ms [08:55:00] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 2.83 ms [08:55:00] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 2.96 ms [08:55:00] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 2.81 ms [08:55:00] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [08:55:00] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 2.48 ms [08:55:10] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 2.55 ms [08:55:10] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 2.80 ms [08:55:11] PROBLEM - Check systemd state on ganeti1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:55:40] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [08:55:50] RECOVERY - SSH on releases1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [08:55:50] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [08:55:51] RECOVERY - HTTP on releases1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15234 bytes in 2.091 second response time [08:56:00] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [08:56:22] (03CR) 10Filippo Giunchedi: "Thanks Paladox! Did you test this in labs? I'd like to run some tests myself as well." [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) (owner: 10Paladox) [09:03:32] (03CR) 10Gehel: [C: 032] "LGTM, puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/10139/" [puppet] - 10https://gerrit.wikimedia.org/r/413407 (https://phabricator.wikimedia.org/T188015) (owner: 10EBernhardson) [09:06:47] <_joe_> anyoine doing something about ganeti1006? [09:07:11] <_joe_> it went down twice [09:07:33] (03CR) 10Ema: "Please use self.report to log the main events (check is up, a icmp destination unreachable has been received). See how the idleconnection " [debs/pybal] - 10https://gerrit.wikimedia.org/r/413211 (https://phabricator.wikimedia.org/T178151) (owner: 10Vgutierrez) [09:07:35] I tried to join the mgmt console but then it was up and showing recoveries [09:07:44] didn't check logs though [09:13:20] RECOVERY - Check systemd state on ganeti1006 is OK: OK - running: The system is fully operational [09:15:10] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 14273695 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:15:11] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 14240109 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:17:10] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 2055 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:17:11] RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 1511 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:23:54] !log copied burrow 0.1 from jessie-wikimedia to stretch-wikimedia [09:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:31] RECOVERY - puppet last run on kafkamon1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:26:24] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2070 and db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414627 (owner: 10Marostegui) [09:26:40] (03Abandoned) 10Gehel: T136696 Including a .policy file to grant permission to send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/295129 (owner: 10Nicko) [09:28:10] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2070 and db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414627 (owner: 10Marostegui) [09:28:12] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2070 and db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414627 (owner: 10Marostegui) [09:29:21] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2055 and db2070 (duration: 00m 55s) [09:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:41] RECOVERY - puppet last run on kafkamon2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:31:43] (03PS1) 10Gilles: Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) [09:32:46] (03PS1) 10Elukey: prometheus::ops|analytics: update Kafka Burrow's exporter config [puppet] - 10https://gerrit.wikimedia.org/r/414632 (https://phabricator.wikimedia.org/T180442) [09:34:00] (03PS5) 10Gehel: maps: Icinga alert when OSM replication lags [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) [09:34:21] (03CR) 10Elukey: [C: 032] prometheus::ops|analytics: update Kafka Burrow's exporter config [puppet] - 10https://gerrit.wikimedia.org/r/414632 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [09:36:31] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:37:31] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [09:38:03] (03PS6) 10Gehel: maps: Icinga alert when OSM replication lags [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) [09:38:42] (03PS2) 10Gilles: Add Thumbor private container user configuration keys [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) [09:39:16] (03CR) 10Gehel: [C: 032] "Looks good: https://puppet-compiler.wmflabs.org/compiler02/10140/" [puppet] - 10https://gerrit.wikimedia.org/r/410172 (https://phabricator.wikimedia.org/T167549) (owner: 10Gehel) [09:45:25] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#4000604 (10Gehel) Those alerts are now available on Icinga and passing. I'll keep an eye on them for the next few days to make sure we don't have fals... [09:48:22] (03PS3) 10Gehel: maps: icinga alert if tiles are not being generated [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) [09:49:27] (03PS1) 10Elukey: role::kafka::monitoring: add lag monitoring for Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/414636 (https://phabricator.wikimedia.org/T180442) [09:51:14] (03PS2) 10Elukey: role::kafka::monitoring: add lag monitoring for Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/414636 (https://phabricator.wikimedia.org/T180442) [09:52:01] (03CR) 10Elukey: [C: 032] role::kafka::monitoring: add lag monitoring for Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/414636 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [09:54:52] (03PS1) 10Gehel: Resize the Cirrus LTR model cache [puppet] - 10https://gerrit.wikimedia.org/r/414637 (https://phabricator.wikimedia.org/T188015) [09:56:42] (03CR) 10DCausse: [C: 031] Resize the Cirrus LTR model cache [puppet] - 10https://gerrit.wikimedia.org/r/414637 (https://phabricator.wikimedia.org/T188015) (owner: 10Gehel) [09:56:52] (03CR) 10Gehel: [C: 032] Resize the Cirrus LTR model cache [puppet] - 10https://gerrit.wikimedia.org/r/414637 (https://phabricator.wikimedia.org/T188015) (owner: 10Gehel) [09:57:37] PROBLEM - puppet last run on kafkamon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:58:28] checking --^ [10:02:29] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4000623 (10aborrero) >>! In T187994#3992756, @aborrero wrote: > [...] > Our use cases could benefit from nftables in several aspects: > * performance, by using sets, maps, dicts and concatenations instead of... [10:04:34] (03PS1) 10Gehel: logstash: kafka analytics cluster isn't available from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/414638 [10:10:06] !log rebooting mw canaries for kernel security update [10:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:42] (03PS1) 10Elukey: role::kafka::monitoring: fix wrong cut in previous change [puppet] - 10https://gerrit.wikimedia.org/r/414639 (https://phabricator.wikimedia.org/T180442) [10:32:15] (03CR) 10Elukey: [C: 032] role::kafka::monitoring: fix wrong cut in previous change [puppet] - 10https://gerrit.wikimedia.org/r/414639 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [10:32:20] (03PS2) 10Elukey: role::kafka::monitoring: fix wrong cut in previous change [puppet] - 10https://gerrit.wikimedia.org/r/414639 (https://phabricator.wikimedia.org/T180442) [10:35:37] (03PS2) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: extend pinnigs for pam libs [puppet] - 10https://gerrit.wikimedia.org/r/413780 (https://phabricator.wikimedia.org/T187193) [10:36:43] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: extend pinnigs for pam libs [puppet] - 10https://gerrit.wikimedia.org/r/413780 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [10:37:37] RECOVERY - puppet last run on kafkamon1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:38:19] kart_: talking with zeljkof looks like we could deploy your ULS patch to maintenance/ULSCompactLinksDisablePref.php right now? ( https://gerrit.wikimedia.org/r/#/c/414609/ ) [10:39:06] ohh no that is more complicated :D [10:39:23] the script is on hold until it is switched to vslow replicas https://gerrit.wikimedia.org/r/#/c/414608/ [10:39:27] which I guess we can also deploy right now [10:40:35] zeljkof: we should get an early morning swat slot just for Kartik :] [10:41:31] hashar, kart_: let's compromise :) and review/merge the patch now (so we don't have to wait for CI during SWAT) but deploy during EU SWAT [10:41:45] if kart_ promises to be around during EU SWAT ;) [10:42:00] they only touch a maintenance script which is run manually [10:42:10] but yeah lets see what kart_ has to say about it :) [10:42:14] I am not worried for sure [10:46:53] (03PS1) 10Marostegui: site.pp: Add a comment about db1113 [puppet] - 10https://gerrit.wikimedia.org/r/414641 (https://phabricator.wikimedia.org/T184704) [10:48:02] (03CR) 10Marostegui: [C: 032] site.pp: Add a comment about db1113 [puppet] - 10https://gerrit.wikimedia.org/r/414641 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [10:49:49] hashar: zeljkof basically, script won't be run today. It is scheduled to run on Wednesday. [10:50:25] hashar: those two patches can be deploy in SWAT or before. I'm fine. Just added as per normal routine procedure. [10:51:01] hashar: I'll take most of free slot of Wednesday to run the script actully.. [10:52:04] kart_: can both of your patches be merged before SWAT, and deployed together? [10:52:20] (deployed during SWAT) [10:52:31] zeljkof: yes. doable. [10:52:40] or should I merge and deploy patches one by one? [10:53:05] just curious, to make the swat quicker [10:53:19] zeljkof: merge both and deploy. Less chance. No need to seprately deploy. [10:53:33] zeljkof: no testing needed, except checking both changes are in wmf.22. [10:54:24] kart_: cool, I'll merge them now and deploy during SWAT [10:54:31] OK! [10:58:21] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I don't think this are for any reason special hosts. They are ordinary hosts (alongside many others in that stanza that are wrongfully pla" [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [11:00:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180226T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:01:51] !log powercycling mw1264 (stuck after reboot) [11:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:10] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414643 (https://phabricator.wikimedia.org/T128546) [11:04:21] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414643 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:05:48] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414643 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:06:56] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414643 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:11:08] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4000729 (10fgiunchedi) I am trying at least to get `role::puppetmaster::standalone` going on stretch, so far not a whole lot of luck, namely the server 500s when cont... [11:11:23] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 00m 58s) [11:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:38] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:12:21] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 00m 57s) [11:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:47] (03CR) 10Lucas Werkmeister (WMDE): "wmf.22 is deployed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413724 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [11:15:51] 10Operations, 10HHVM, 10MW-1.31-release-notes (WMF-deploy-2018-02-13 (1.31.0-wmf.21)), 10Performance-Team (Radar): HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#4000758 (10Joe) 05Open>03Resolved a:03Joe [11:19:26] 10Operations, 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Tests, 10HHVM: Readd complete URL parsing fix from 3.18.7 release - https://phabricator.wikimedia.org/T185024#4000766 (10MoritzMuehlenhoff) p:05Unbreak!>03Normal [11:20:10] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4000768 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:21:53] 10Operations, 10media-storage: Have swift metrics available in Prometheus - https://phabricator.wikimedia.org/T187991#4000771 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:24:24] 10Operations: Define a special range in constants.pp for the LVS hosts - https://phabricator.wikimedia.org/T187910#3989817 (10MoritzMuehlenhoff) @Andrew : There is $CACHE_MISC already as a network constant / ferm macro. [11:25:06] 10Operations, 10ops-codfw: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#4000781 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [11:25:27] 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4000783 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:27:55] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4000789 (10fgiunchedi) After manually running `puppet master --debug --no-daemonize --masterport 8142` and then interrupting it, apparently now also phusion is able t... [11:41:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:42:37] <_joe_> whoa what a peak [11:42:46] <_joe_> it's already gone, but still [11:42:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:43:11] <_joe_> someone should take a look, this looks genuinely like a small outage [11:49:17] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:49:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:06:39] (03PS2) 10Alexandros Kosiaris: apache: Support IPv6 in status [puppet] - 10https://gerrit.wikimedia.org/r/411193 [12:06:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apache: Support IPv6 in status [puppet] - 10https://gerrit.wikimedia.org/r/411193 (owner: 10Alexandros Kosiaris) [12:09:47] (03PS1) 10Arturo Borrero Gonzalez: apt: apt_upgrade: include link to wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/414649 (https://phabricator.wikimedia.org/T181647) [12:10:48] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt_upgrade: include link to wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/414649 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:10:58] (03PS1) 10Ladsgroup: Add patrol rights/groups to fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414650 (https://phabricator.wikimedia.org/T187662) [12:11:42] (03CR) 10Paladox: "> Thanks Paladox! Did you test this in labs? I'd like to run some" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) (owner: 10Paladox) [12:25:59] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#4000988 (10jcrespo) 05Resolved>03Open This failed again, I guess because using a bad disk: Predictive Failure: 1I:1:1 [12:26:31] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#4000990 (10Marostegui) a:05Marostegui>03Papaul [12:28:10] (03CR) 10Sau226: "I've filed a request on the page the admin linked in the phab task and will add relevant info to the task when consensus is acquired." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414509 (https://phabricator.wikimedia.org/T184959) (owner: 10Sau226) [12:30:59] 10Operations: TransparencyReport-private is not auto deploying - https://phabricator.wikimedia.org/T188224#4001006 (10akosiaris) p:05Triage>03Normal It's updating fine from what I see. Both https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/TransparencyReport-private and the repo on the server are at... [12:33:48] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875#4001024 (10Addshore) Bump as this is probably trivial but needs the right pair of hands to get it done. [12:58:42] (03PS1) 10Ladsgroup: Enable statement usage tracking in several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414654 (https://phabricator.wikimedia.org/T151717) [13:09:34] !log rebooting video scalers in eqiad for kernel security update [13:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:46] (03PS1) 10Arturo Borrero Gonzalez: toollabs: tools-clush-generator: introduce clush group 'one_of_each' [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) [13:26:30] (03PS2) 10Arturo Borrero Gonzalez: toollabs: tools-clush-generator: introduce clush group 'one_of_each' [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) [13:29:30] (03CR) 10Phuedx: [C: 031] "Given the comment in the header of pp_stage1_raw.dblist, I think that the pp_*.dblist can be deleted safely. If you've queued this for dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413978 (owner: 10Krinkle) [13:34:56] (03PS7) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) [13:37:55] (03CR) 10Elukey: "Me and Gehel had a chat over IRC, we have a Kafka cluster in deployment prep but its name is not analytics, but 'jumbo-deployment-prep'. T" [puppet] - 10https://gerrit.wikimedia.org/r/414638 (owner: 10Gehel) [13:39:29] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10hardware-requests: eqiad: (2) systems for labstore expansion (labstore1008 & labstore1009) - https://phabricator.wikimedia.org/T186931#4001142 (10chasemp) @robh poke [13:40:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4001145 (10chasemp) @robh or @Cmjohnson any luck figuring out what the NIC situation is here? We will have to figure out something fairly soon if we need to order different NICs. [13:59:11] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4001197 (10fgiunchedi) a:03fgiunchedi [13:59:26] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#3888192 (10fgiunchedi) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180226T1400). [14:00:05] Jhs, kart_, Lucas_WMDE, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:16] I can SWAT today [14:00:21] I also have another one coming [14:00:55] * kart_ is here [14:00:58] kart_: I will deploy your commits first, since they are already merged, is there anything to test? [14:01:06] or should I just deploy? [14:01:07] zeljkof: nope. [14:01:19] Just deploy. I'll verify quickly in branch. [14:01:21] kart_: ok, I'll let you know once it's deployed [14:01:33] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4001208 (10MoritzMuehlenhoff) >>! In T182832#3982284, @elu... [14:02:28] * Lucas_WMDE is here [14:03:31] (03PS1) 10Awight: Restore ORES celery worker count; kill defaults [puppet] - 10https://gerrit.wikimedia.org/r/414666 [14:03:40] (03PS1) 10Ladsgroup: Enable reading full entity id from wb_terms table in three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414667 (https://phabricator.wikimedia.org/T114903) [14:04:03] The third one: https://gerrit.wikimedia.org/r/414667 [14:04:11] (03CR) 10jerkins-bot: [V: 04-1] Restore ORES celery worker count; kill defaults [puppet] - 10https://gerrit.wikimedia.org/r/414666 (owner: 10Awight) [14:04:42] 10Operations, 10Ops-Access-Requests: Add Ian Marlier to udp2log-users group - https://phabricator.wikimedia.org/T188042#4001214 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [14:06:32] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 792.21 seconds [14:07:20] chcking [14:08:45] (03PS1) 10Muehlenhoff: Add imarlier to udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/414668 (https://phabricator.wikimedia.org/T188042) [14:10:38] !log zfilipin@tin Synchronized php-1.31.0-wmf.22/extensions/UniversalLanguageSelector/maintenance/ULSCompactLinksDisablePref.php: SWAT: [[gerrit:414609|Added option to continue script from particular User ID]] [[gerrit:414608|Use a replica dedicated to slow queries (if available) (T187880)]] (duration: 00m 58s) [14:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:54] T187880: Improve preference migration script - https://phabricator.wikimedia.org/T187880 [14:11:18] kart_: deployed, please check and thanks for deploying with #releng ;) [14:11:23] Okay [14:11:26] (03PS2) 10Awight: Restore ORES celery worker count; kill defaults [puppet] - 10https://gerrit.wikimedia.org/r/414666 [14:11:28] Amir1: do you want to deploy next? [14:11:48] (I need some time to review other patches) [14:12:18] zeljkof: at some sort of meeting atm :/ [14:12:32] Amir1: should I deploy your patches? [14:12:40] or will you do it later in the swat window? [14:12:40] zeljkof: it would be great [14:12:44] Amir1: sure, will do [14:13:05] Jhs, Lucas_WMDE: do you want to deploy your patches, if you can? [14:13:40] zeljkof: I don’t have deploy rights, so I’ll have to depend on the lovely #releng folks to help me :) [14:13:53] (03PS1) 10Giuseppe Lavagetto: Add the --hostname switch to simple node actions. [software/conftool] - 10https://gerrit.wikimedia.org/r/414669 [14:13:55] (03PS1) 10Giuseppe Lavagetto: Make full path of the object seen in the output for any change in SetAction and EditAction [software/conftool] - 10https://gerrit.wikimedia.org/r/414670 [14:13:56] but I can test on the debug servers [14:14:12] Lucas_WMDE: will do :) as far as I rememer, Jhs also can not deploy [14:14:45] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#4001238 (10hashar) So yes: lets decommission labnodepool1002.eqiad.wmnet :] [14:15:23] (03CR) 10jerkins-bot: [V: 04-1] Add the --hostname switch to simple node actions. [software/conftool] - 10https://gerrit.wikimedia.org/r/414669 (owner: 10Giuseppe Lavagetto) [14:15:28] (03CR) 10jerkins-bot: [V: 04-1] Make full path of the object seen in the output for any change in SetAction and EditAction [software/conftool] - 10https://gerrit.wikimedia.org/r/414670 (owner: 10Giuseppe Lavagetto) [14:15:37] !log rebooting scb in codfw for kernel security updates [14:15:43] Jhs: around for SWAT? [14:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:22] Lucas_WMDE: you are next, I'll let you know when your patch is at mwdebug1002 [14:16:27] ok thanks [14:16:29] in a few minutes [14:17:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413724 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [14:17:34] zeljkof: looks good. Thanks! [14:17:46] kart_: /me thumbs up ;) [14:17:49] (sorry, took more minutes than I assumed) [14:18:24] kart_: no problem, it's in a separate place from the other things, so I could do other stuff in parallel :) [14:18:59] (03Merged) 10jenkins-bot: Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413724 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [14:19:13] (03CR) 10jenkins-bot: Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413724 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [14:19:51] (03PS4) 10Niedzielski: New: add chromium_render service [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) [14:21:21] (03CR) 10Filippo Giunchedi: "> > Thanks Paladox! Did you test this in labs? I'd like to run some" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) (owner: 10Paladox) [14:21:30] (03CR) 10Filippo Giunchedi: [C: 04-1] puppetmaster: Use ruby-mysql2 over ruby-mysql and migrate servermon to it [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) (owner: 10Paladox) [14:21:30] there is this research query blocking enwiki replication to dbstore1002 [14:22:05] Lucas_WMDE: your patch is at mwdebug1002, please test and let me know if I can deploy it [14:22:13] zeljkof: already testing, thank you :) [14:22:18] looking good so far [14:22:23] (03CR) 10Paladox: "> > > Thanks Paladox! Did you test this in labs? I'd like to run some" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) (owner: 10Paladox) [14:22:48] (03PS1) 10Filippo Giunchedi: puppetmaster: ruby-activerecord-deprecated-finders not in stretch [puppet] - 10https://gerrit.wikimedia.org/r/414674 (https://phabricator.wikimedia.org/T184562) [14:22:51] (03PS1) 10Filippo Giunchedi: WIP ruby-mysql2 [puppet] - 10https://gerrit.wikimedia.org/r/414675 (https://phabricator.wikimedia.org/T184562) [14:24:12] zeljkof, i'm here now, [14:24:15] almost forgot [14:24:23] Jhs: ok, you are next, please stand by :) [14:24:43] you patch will be at mwdebug1002 in 5-10 minutes [14:24:54] I'll let you know when it's there [14:25:06] cool [14:26:33] zeljkof: okay, everything seems to be working so far… [14:26:41] (03CR) 10Volans: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/414670 (owner: 10Giuseppe Lavagetto) [14:26:44] and I think I’m done with testing [14:26:52] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4001287 (10fgiunchedi) So `role::puppetmaster::standalone` with the patches proposed above works on stretch. For production AFAIK it isn't trivia... [14:26:54] Lucas_WMDE: ok to deploy? [14:26:54] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252#4001288 (10Gehel) [14:27:02] yes, I think it is [14:27:07] Lucas_WMDE: deploying [14:27:22] I'll be ready for deploy in five minutes or so [14:27:27] the meeting has finished [14:27:45] jynus: I'll look at that ticket and altering the query asap :) [14:28:10] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:413724|Enable caching of constraint check results (T184812)]] (duration: 00m 55s) [14:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:24] T184812: Enable constraint result caching on Wikidata - https://phabricator.wikimedia.org/T184812 [14:28:36] Amir1: ok, just to deploy Jhs's patch and the swat is yours :) [14:28:53] Lucas_WMDE: deployed! please test and thank for deploying with #releng ;) [14:29:05] Jhs: merging your patch [14:29:05] zeljkof: thank you, always a pleasure :) [14:29:30] addshore: https://phabricator.wikimedia.org/T175790#4001318 [14:29:47] that will also fix it when the analytics topology is changed [14:30:01] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:30:03] (03PS1) 10BBlack: interface::rps: change IRQ count without reboot [puppet] - 10https://gerrit.wikimedia.org/r/414676 [14:30:06] e.g. if you have a single staging db but enwiki is "outside" [14:30:11] cool [14:30:15] let me know when it's done [14:30:32] (03CR) 10jerkins-bot: [V: 04-1] interface::rps: change IRQ count without reboot [puppet] - 10https://gerrit.wikimedia.org/r/414676 (owner: 10BBlack) [14:31:43] (03Merged) 10jenkins-bot: Add namespaces to urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:31:57] (03CR) 10jenkins-bot: Add namespaces to urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:32:58] Jhs: your patch is at mwdebug1002, please test and let me know if I can deploy it [14:33:57] zeljkof, looks good as far as I can tell (Y) [14:34:04] remember to run the script :) [14:34:15] Wait [14:34:20] I want to test same [14:34:23] No deploy [14:34:24] I don't think there are any conflicting pages [14:34:43] (03PS2) 10BBlack: interface::rps: change IRQ count without reboot [puppet] - 10https://gerrit.wikimedia.org/r/414676 [14:35:57] Jhs: uh, which script? the patch and the task are both big, can not find it [14:36:21] zeljkof, mwscript namespaceDupes.php urwiktionary --fix [14:36:33] (from memory) [14:36:40] Jhs: help [14:36:42] Jhs: http://prntscr.com/ijyrqw [14:36:46] Jhs: Is it good? [14:36:52] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4001344 (10Joe) I think reusing the imagescalers (which are quite beefy machines) to this purpose is a good idea. I don't think that the normal load of the videoscalers cluster meri... [14:37:15] Jhs: thanks, will do [14:37:33] Zoranzoki21, looks right. click the talk page tab as well and check that there is only one colon : in the title, not two [14:37:42] ok [14:37:44] looks good [14:37:50] zeljkof: I tested more detailed [14:37:57] zeljkof: Lets deploy it [14:38:34] Jhs, Zoranzoki21: ok, deploying [14:40:04] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:407901|Add namespaces to urwiktionary (T186393)]] (duration: 00m 56s) [14:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:19] T186393: Localize namespaces on urwiktionary - https://phabricator.wikimedia.org/T186393 [14:40:19] Jhs, Zoranzoki21: deployed, running the script [14:40:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] "No, not really. There are clearly calls in the code that belong to mysql gem API and are not present in the mysql2 gem API. e.g. things li" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (https://phabricator.wikimedia.org/T184562) (owner: 10Paladox) [14:42:41] Jhs, Zoranzoki21: the script is done https://phabricator.wikimedia.org/T186393#4001362 please check and thanks for deploying with #releng! ;) [14:43:15] thanks zeljkof :) o/ [14:43:19] Amir1: the SWAT is all yours! :) [14:43:27] Thanks! [14:43:38] (03PS2) 10Ladsgroup: Enable statement usage tracking in several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414654 (https://phabricator.wikimedia.org/T151717) [14:43:47] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414654 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [14:43:48] Amir1: don't forget to close the window with !log EU SWAT finished :) [14:43:51] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.79 seconds [14:43:57] zeljkof: Sure [14:45:24] Amir1: how many patches do you have? [14:45:30] (03Merged) 10jenkins-bot: Enable statement usage tracking in several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414654 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [14:45:32] three [14:45:35] okay [14:45:43] I might add 1 thing to the end of swat [14:47:04] (03CR) 10jenkins-bot: Enable statement usage tracking in several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414654 (https://phabricator.wikimedia.org/T151717) (owner: 10Ladsgroup) [14:47:27] 10Operations, 10hardware-requests: Site: (2) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4001386 (10faidon) a:03RobH Sounds good. Note that eqiad has 6 imagescalers (mw1293-mw1298) and codfw has 4 now ( mw2244-2245/mw2150-2151) but let's go with reassigning 4+4 for vi... [14:47:52] I put v instead of the Shift + V the log is a little bit weird, sorry [14:48:01] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:V|Enable statement usage tracking in several wikis (T151717)]] (duration: 00m 57s) [14:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:19] T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717 [14:48:20] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414650 (https://phabricator.wikimedia.org/T187662) (owner: 10Ladsgroup) [14:48:27] (03PS2) 10Ladsgroup: Add patrol rights/groups to fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414650 (https://phabricator.wikimedia.org/T187662) [14:48:38] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414650 (https://phabricator.wikimedia.org/T187662) (owner: 10Ladsgroup) [14:50:10] !log upload puppetdb 4.4.0-1~wmf1 to stretch-wikimedia - T177253 [14:50:10] (03Merged) 10jenkins-bot: Add patrol rights/groups to fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414650 (https://phabricator.wikimedia.org/T187662) (owner: 10Ladsgroup) [14:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:23] T177253: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 [14:50:28] (03CR) 10jenkins-bot: Add patrol rights/groups to fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414650 (https://phabricator.wikimedia.org/T187662) (owner: 10Ladsgroup) [14:50:38] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4001397 (10faidon) 05Open>03stalled p:05Triage>03Normal [14:51:22] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:44] confirming the patch works fine, moving forward [14:52:58] !log rebooting relforge for kernel upgrade [14:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.145 second response time [14:53:56] (03PS2) 10Ladsgroup: Enable reading full entity id from wb_terms table in three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414667 (https://phabricator.wikimedia.org/T114903) [14:54:25] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414667 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [14:54:38] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:414650|Add patrol rights/groups to fawikisource (T187662)]] (duration: 00m 56s) [14:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:51] T187662: Add autopatrol and related rights to fawikisource - https://phabricator.wikimedia.org/T187662 [14:56:31] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:51] (03PS14) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [15:00:36] (03PS3) 10BBlack: interface::rps: change IRQ count without reboot [puppet] - 10https://gerrit.wikimedia.org/r/414676 [15:01:03] (03Merged) 10jenkins-bot: Enable reading full entity id from wb_terms table in three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414667 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [15:01:16] (03CR) 10jenkins-bot: Enable reading full entity id from wb_terms table in three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414667 (https://phabricator.wikimedia.org/T114903) (owner: 10Ladsgroup) [15:04:58] Amir1: are you all done? [15:05:10] not yet [15:05:12] okay [15:05:16] ping me when you are :) [15:06:56] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4001464 (10Vgutierrez) On pybal-test2001 the following behaviour can be observed: ``` pybal -d | grep -i bgp pybal -d 2>&1 | grep -i b... [15:08:07] (03PS3) 10Zoranzoki21: Add mushroomobserver.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) [15:09:20] (03CR) 10jerkins-bot: [V: 04-1] Add mushroomobserver.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [15:09:38] works fine on mwdebug1002, moving forward [15:11:02] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 1763 MB (3% inode=96%) [15:11:49] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: [[gerrit:414667|Enable reading full entity id from wb_terms table in three wikis (T114903)]] (duration: 00m 56s) [15:11:55] !log reboot of relforge completed, cluster is green again [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:03] T114903: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903 [15:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:38] !log This might have performance implications roll it back if it affects these wikis too much [15:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] addshore: the floor is yours, don't forget to log when EU SWAT is finished [15:13:02] RECOVERY - Disk space on rhenium is OK: DISK OK [15:13:17] thanks [15:13:18] will do! [15:19:38] !log addshore@tin Started scap: Updated mediawiki/extensions/AdvancedSearch i18n files for some translations [15:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:00] thats the last thing in EU swat [15:21:18] 10Operations: Remove imagescaler cluster (aka 'rendering') - https://phabricator.wikimedia.org/T188062#4001493 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [15:27:44] akosiaris hi, how would i replace fetch_row please? I've been looking but the only thing i've come accross is [15:27:45] https://stackoverflow.com/questions/14064649/ruby-mysql-fetching-single-row-but-still-using-each [15:28:06] also would this if rs.num_rows.zero? become if rs.count > 0 ? [15:30:30] paladox: I honestly don't know. I 'll have to study the mysql2 API (which I haven't had found the time do yet) [15:31:08] !log addshore@tin Finished scap: Updated mediawiki/extensions/AdvancedSearch i18n files for some translations (duration: 11m 29s) [15:31:18] paladox: a quick look at e.g. https://github.com/brianmario/mysql2/search?utf8=%E2%9C%93&q=num_rows&type= is that tipped me off that the API is different [15:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:37] but that's about the depth at which I have went to up to now [15:32:09] paladox: btw I took a stab at https://gerrit.wikimedia.org/r/#/c/414675/ but again it'll need testing [15:32:13] also WIP [15:32:19] !log EU SWAT done [15:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:35] godog thanks. [15:32:44] akosiaris oh [15:33:07] I think that godog is at a good path though [15:33:28] (03CR) 10Paladox: WIP ruby-mysql2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/414675 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:33:48] akosiaris yep, just one comment though [15:33:51] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:33:54] which i posted :) [15:34:05] akosiaris: so in theory to test it on a puppetmaster what's needed is reports = servermon and the db* config values (?) [15:34:20] godog: yes [15:34:31] PROBLEM - puppet last run on actinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:34:47] what I always end up doing is having a puppetmaster set with very low load and live testing it in production :P [15:36:14] hahah live_testing_in_production.jpg [15:41:24] !log marking wikitech read-only (via a local edit to CommonSettings.php) for https://phabricator.wikimedia.org/T188029 [15:41:26] andrewbogott: Failed to log message to wiki. Somebody should check the error logs. [15:41:41] oh, of course [15:42:07] !log marking wikitech read-only (via a local edit to CommonSettings.php) for https://phabricator.wikimedia.org/T188029 [15:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:12] !log swapping failed disk db1068 [15:43:19] marostegui ^ [15:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:26] cmjohnson1: thanks!! [15:45:12] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.359 second response time [15:45:36] !log made wikitech read/write again pending a bit more preliminary work [15:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:19] (03CR) 10Dzahn: "ok, i didn't know where we draw the line between "special" host and "regular" host. I figured these are monitoring hosts and that's why." [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [15:46:31] marostegui disk in slot 2 is rebuilding. Ping me after and I can swap the other [15:46:35] (03Abandoned) 10Dzahn: network::constants: add kafkamon servers [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [15:47:26] cmjohnson1: I can see the disk now being rebuilt, thanks. It will probably take a while…probably the other one will be done tomorrow I guess. [15:47:29] (03PS2) 10Muehlenhoff: Switch debdeploy clients to Python 3 (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 [15:47:46] okay, just let me know or ping me in task [15:48:00] (03PS1) 10Hashar: Tweak gbp to use 'master' has the upstream branch [software/conftool] - 10https://gerrit.wikimedia.org/r/414694 [15:48:08] cmjohnson1: will do - thanks a lot [15:48:10] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4001627 (10Marostegui) Thanks Chris: ``` root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 1% in 1 Minutes. ``` Once this is finish... [15:48:31] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:29] (03CR) 10jerkins-bot: [V: 04-1] Tweak gbp to use 'master' has the upstream branch [software/conftool] - 10https://gerrit.wikimedia.org/r/414694 (owner: 10Hashar) [15:50:04] (03PS2) 10Hashar: Tweak gbp to use 'master' has the upstream branch [software/conftool] - 10https://gerrit.wikimedia.org/r/414694 [15:50:06] (03PS1) 10Hashar: Typo in changelog: jesse -> jessie [software/conftool] - 10https://gerrit.wikimedia.org/r/414695 [15:50:36] (03Abandoned) 10Dzahn: introduce role(kafkamon) and make new VMs use it [puppet] - 10https://gerrit.wikimedia.org/r/413672 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [15:51:32] (03CR) 10jerkins-bot: [V: 04-1] Typo in changelog: jesse -> jessie [software/conftool] - 10https://gerrit.wikimedia.org/r/414695 (owner: 10Hashar) [15:51:37] (03CR) 10jerkins-bot: [V: 04-1] Tweak gbp to use 'master' has the upstream branch [software/conftool] - 10https://gerrit.wikimedia.org/r/414694 (owner: 10Hashar) [15:52:26] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/414694 (owner: 10Hashar) [15:52:56] (03PS4) 10BBlack: interface::rps: change IRQ count without reboot [puppet] - 10https://gerrit.wikimedia.org/r/414676 [15:53:00] (03PS1) 10BBlack: numa_networking: remove "isolate" experiment [puppet] - 10https://gerrit.wikimedia.org/r/414697 [15:53:53] (03CR) 10jerkins-bot: [V: 04-1] Tweak gbp to use 'master' has the upstream branch [software/conftool] - 10https://gerrit.wikimedia.org/r/414694 (owner: 10Hashar) [15:55:30] (03Abandoned) 10Dzahn: webserver_misc_apps: remove kafka related includes [puppet] - 10https://gerrit.wikimedia.org/r/413673 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [15:55:31] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.618 second response time [15:56:31] (03PS2) 10Andrew Bogott: wikitech: grants for the new labswiki db on m5 [puppet] - 10https://gerrit.wikimedia.org/r/413884 (https://phabricator.wikimedia.org/T188029) [15:58:41] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:53] mobrovac: is pdfrender under your umbrella? it is flopping [16:00:58] the one on 1004 [16:01:05] (03Draft1) 10Paladox: Add build documentation on building the plugin [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414698 [16:01:06] (03PS2) 10Paladox: Add build documentation on building the plugin [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414698 [16:01:22] 10Operations, 10ops-codfw: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#4001687 (10Papaul) @Marostegui can you please depool the system for me? Thanks [16:01:47] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%) [16:01:54] ah [16:01:58] andrewbogott: ^ [16:02:07] andrewbogott: did you try to do a local backup? [16:02:08] yep, I'm on it [16:02:48] RECOVERY - MariaDB disk space on silver is OK: DISK OK [16:02:51] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.674 second response time [16:03:33] (03PS2) 10Dzahn: lists: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409480 [16:03:36] (03PS1) 10Marostegui: db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414699 (https://phabricator.wikimedia.org/T187534) [16:04:18] (03CR) 10BBlack: [C: 032] "PCC says all-ok here as expect (functional no-op, no hosts are currently configured with "isolate")" [puppet] - 10https://gerrit.wikimedia.org/r/414697 (owner: 10BBlack) [16:04:31] RECOVERY - puppet last run on actinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:04:52] (03PS1) 10Subramanya Sastry: Enable RemexHtml on all wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414700 (https://phabricator.wikimedia.org/T188000) [16:04:54] (03PS1) 10Subramanya Sastry: Enable RemexHtml on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414701 [16:04:56] (03PS1) 10Subramanya Sastry: Enable RemexHtml on a few miscellaneous wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414702 [16:04:59] (03PS3) 10Dzahn: lists: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409480 [16:05:22] marostegui can you make the failed/failed ssd blink on db1111 [16:05:35] cmjohnson1: let me seeee [16:05:38] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414699 (https://phabricator.wikimedia.org/T187534) (owner: 10Marostegui) [16:05:51] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4001762 (10fgiunchedi) I mocked some configuration values and installed mariadb on `puppetmaster-filippo-stretch2` to test `servermon.rb` reporte... [16:07:21] (03CR) 10Zoranzoki21: "recheck Jenkins ill" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [16:07:24] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414699 (https://phabricator.wikimedia.org/T187534) (owner: 10Marostegui) [16:07:34] (03CR) 10jenkins-bot: db-codfw.php: Depool db2049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414699 (https://phabricator.wikimedia.org/T187534) (owner: 10Marostegui) [16:07:38] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [16:08:24] (03PS4) 10Dzahn: lists: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409480 [16:08:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2049 - T187534 (duration: 00m 56s) [16:09:19] !log Stop MySQL db2049 to get its mgmt network fixed - T187534 [16:09:44] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414704 [16:11:00] (03CR) 10Alexandros Kosiaris: "Well.. the "special" part was me back in Icd266ac5f1c0edd40d07de041be90422f8003daf. I specifically wanted to create 2 sets of hosts (monit" [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [16:11:23] cmjohnson1: let me know if you see it blinking now [16:11:46] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414704 (https://phabricator.wikimedia.org/T188129) [16:11:54] !log Poweroff db2049 for maintenance - T187534 [16:12:13] i see it [16:12:14] 10Operations, 10ops-codfw, 10Patch-For-Review: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#4001815 (10Marostegui)  @papaul db2049 is now off [16:12:21] ugh, stashbot left? [16:12:21] cmjohnson1: cool! [16:12:30] bd808: ^ [16:12:34] jouncebot, now [16:12:34] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [16:12:46] !log replacing disk slot 5 db1111 [16:14:01] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.666 second response time [16:15:52] 10Operations, 10ops-codfw, 10Patch-For-Review: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#4001817 (10Papaul) Thanks [16:16:27] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/10142/fermium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/409480 (owner: 10Dzahn) [16:16:51] 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4001826 (10Cmjohnson) The ssd was replaced, @marostegui please confirm and resolve after rebuild Return shipping informaitn USPS 9202 3946 5301 2438 0714 10 FEDEX 961191... [16:17:02] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4001827 (10Paladox) @fgiunchedi could that be the heap? https://stackoverflow.com/questions/20297524/c-free-invalid-pointer [16:17:11] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:13] 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4001835 (10Marostegui) @Cmjohnson looks like storage crashed and the FS became read-only. We are investigating why... [16:20:23] !log no logging [16:24:24] 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4001842 (10Marostegui) This is all we have from the HW logs: ``` /admin1/system1/logs1/log1-> show record3 properties CreationTimestamp = 20180226161220.000000-360... [16:24:52] PROBLEM - Host db2049.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:37] restarting stashbot on toolforge [16:26:02] !log restarted stashbot on toolforge because it didn't react to !log [16:26:30] 10Operations, 10Ops-Access-Requests, 10Analytics-Kanban, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802#4001844 (10elukey) @HaeB Hi! Do you still need these perms or can we roll them back? [16:26:42] !log test !log [16:26:45] ... [16:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:14] mutante: thanks! [16:27:20] (03PS2) 10Filippo Giunchedi: WIP ruby-mysql2 [puppet] - 10https://gerrit.wikimedia.org/r/414675 (https://phabricator.wikimedia.org/T184562) [16:27:20] (03PS1) 10Filippo Giunchedi: hieradata: depool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/414706 (https://phabricator.wikimedia.org/T184562) [16:27:22] (03PS1) 10Filippo Giunchedi: install_server: reinstall rhodium with Stretch [puppet] - 10https://gerrit.wikimedia.org/r/414707 (https://phabricator.wikimedia.org/T184562) [16:27:22] godog: but it did not fix it :/ [16:27:33] godog: oh, now it did :) [16:27:46] mutante: yeah I think it might be related to wikitech being ro [16:27:49] i was already in "./bin/stashbot.sh tail" heh [16:27:57] found the docs on wikitech [16:28:28] dzahn user could not do it, but root could "become stashbot" [16:29:11] !log restarted stashbot on toolforge because it didn't react to !log [16:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:02] RECOVERY - Host db2049.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 36.99 ms [16:31:12] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.679 second response time [16:31:36] !log Maintenance: removing Msw-d4-codfw for replacement:T187534 [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:51] T187534: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534 [16:33:23] (03PS1) 10Rush: openstack: labtestcontrol2003 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/414708 (https://phabricator.wikimedia.org/T188266) [16:33:43] !log Reboot db1111 storage crashed - T187526 [16:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:58] T187526: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526 [16:34:21] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:38] (03CR) 10Rush: [C: 032] openstack: labtestcontrol2003 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/414708 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:35:52] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:36:11] PROBLEM - Host ores2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:36:18] (03CR) 10Elukey: "As FYI zookeeper ferm rules are already in hiera: https://gerrit.wikimedia.org/r/#/c/413685/2/hieradata/role/common/configcluster.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/413889 (https://phabricator.wikimedia.org/T187805) (owner: 10Dzahn) [16:39:22] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.735 second response time [16:39:40] !log making wikitech read-only (via a local patch) while I migrate the database to m5 [16:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:01] RECOVERY - Host ps1-d4-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.97 ms [16:41:21] RECOVERY - Host ores2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.93 ms [16:42:31] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:21] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 0.610 second response time [16:46:32] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:03] (03PS2) 10Herron: WIP: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 [16:47:40] (03PS1) 10Vgutierrez: Provide testing for FSM.BGPTimer [debs/pybal] - 10https://gerrit.wikimedia.org/r/414711 (https://phabricator.wikimedia.org/T188085) [16:47:44] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (owner: 10Herron) [16:50:17] (03PS2) 10Giuseppe Lavagetto: Add the --hostname switch to simple node actions. [software/conftool] - 10https://gerrit.wikimedia.org/r/414669 [16:50:19] (03PS2) 10Giuseppe Lavagetto: Make full path of the object seen in the output for any change in SetAction and EditAction [software/conftool] - 10https://gerrit.wikimedia.org/r/414670 [16:50:53] 10Operations, 10netops: cr1-eqsin faulty interfaces - https://phabricator.wikimedia.org/T187807#4001998 (10ayounsi) Most recent update was: > We are pushing the delivery by next week if there everything is smooth and no customs clearance issue. Send on the 24th. Still asking for a more accurate ETA. [16:51:28] (03CR) 10jerkins-bot: [V: 04-1] Add the --hostname switch to simple node actions. [software/conftool] - 10https://gerrit.wikimedia.org/r/414669 (owner: 10Giuseppe Lavagetto) [16:51:50] (03CR) 10jerkins-bot: [V: 04-1] Make full path of the object seen in the output for any change in SetAction and EditAction [software/conftool] - 10https://gerrit.wikimedia.org/r/414670 (owner: 10Giuseppe Lavagetto) [16:51:53] 10Operations, 10ops-codfw, 10netops: codfw: mgmt switch replacement in D4 - https://phabricator.wikimedia.org/T187816#4002005 (10Papaul) 05Open>03Resolved Switch replacement complete. - Racktables update - Test serial console of all 3 servers connected to msw-d4-codfw Resolving this task. [16:52:20] 10Operations, 10ops-codfw, 10Patch-For-Review: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#4002008 (10Marostegui) Looks like this is back to life: ``` root@db2049.mgmt.codfw.wmnet's password: User:root logged-in to ILO2M245205HN.(10.193.1.99 / FE80::FE15:B4FF:FE92:E... [16:54:29] 10Operations: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#4002028 (10Dzahn) p:05Triage>03High doing this now with hi prio ( using misc system from T184480#3938314) [16:57:41] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.276 second response time [16:58:27] (03PS1) 10Giuseppe Lavagetto: Release 1.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/414715 [16:59:40] (03CR) 10jerkins-bot: [V: 04-1] Release 1.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/414715 (owner: 10Giuseppe Lavagetto) [17:00:29] (03PS1) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [17:00:44] (03CR) 10Giuseppe Lavagetto: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/414715 (owner: 10Giuseppe Lavagetto) [17:01:52] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:52] 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4002101 (10Marostegui) Looks like the new disk has not been added automatically to the RAID. I have been digging around the PERC menu, but it is terribly slow from here,... [17:04:03] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on all wikinews wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414700 (https://phabricator.wikimedia.org/T188000) (owner: 10Subramanya Sastry) [17:04:18] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414701 (owner: 10Subramanya Sastry) [17:05:12] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on a few miscellaneous wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414702 (owner: 10Subramanya Sastry) [17:07:17] 10Operations, 10ops-codfw, 10Patch-For-Review: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#4002144 (10Papaul) 05Open>03Resolved - Power drain server - Reset ILO Server is back up. [17:09:02] 10Operations: Remove 'moodbar-admin' from 'staff' global group - https://phabricator.wikimedia.org/T188278#4002152 (10MarcoAurelio) [17:13:38] (03CR) 10Andrew Bogott: "I've manually applied these changes on db1009" [puppet] - 10https://gerrit.wikimedia.org/r/413884 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [17:14:11] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [17:15:01] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#4002197 (10mmodell) Thanks @MoritzMuehlenhoff! Please let... [17:15:11] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [17:16:19] 10Operations: Remove 'moodbar-admin' from 'staff' global group - https://phabricator.wikimedia.org/T188278#4002199 (10MarcoAurelio) Also, `sendemail-new-users` was recently added to the steward group, but it is not listed as an avalaible permission anymore. Perhaps it should be taken away too. [17:17:19] (03CR) 10Alexandros Kosiaris: "FWIW rhodium is old enough to warrant refresh. At the same time we are doing quite well perf wise so to even justify doing without it. But" [puppet] - 10https://gerrit.wikimedia.org/r/414707 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [17:20:38] jynus: hi, do you know when T188048 might go out? [17:20:39] T188048: Deploy ReadingLists schema change for efficient count(*) handling - https://phabricator.wikimedia.org/T188048 [17:21:15] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4002221 (10Vgutierrez) As suggested by @mark, https://gerrit.wikimedia.org/r/414716 provides unique logging for bgp.py classes based o... [17:25:07] tgr: on a meeting and a possible outage [17:25:18] ask on whatever ticket is for an estimation [17:26:15] (03CR) 10Bstorm: "Would it be possible to use random.choice instead of [0]? I haven't looked at the data structure, but that seems like it would work here." [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:30:12] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Watching / External), and 2 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759#4002278 (10greg) [17:30:21] 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086#4002280 (10greg) [17:32:23] !log shutdown sca1004 on ganeti1005 for T181121 [17:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:43] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [17:32:54] (03CR) 10Arturo Borrero Gonzalez: "> Would it be possible to use random.choice instead of [0]? I" [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:34:12] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:34:43] (03CR) 10Rush: "would it be possible to grab the non-floating-ip assigned instance in an HA pair for this pool I wonder?" [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:34:51] !log deploying new query killer to db1109 [17:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:31] 10Operations, 10Mobile-Content-Service, 10ORES, 10Reading-Infrastructure-Team-Backlog, and 2 others: Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#4002357 (10awight) 05Open>03Resolved a:03awight We've moved to a dedicated cluster—the best possible way to limit resources ;-) [17:36:42] PROBLEM - HP RAID on db2048 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:1 - Controller: OK - Battery/Capacitor: OK [17:36:43] ACKNOWLEDGEMENT - HP RAID on db2048 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:1 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T188286 [17:36:51] 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4002363 (10ops-monitoring-bot) [17:37:23] (03PS1) 10Chad: Revert "Enable caching of constraint check results" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414720 [17:39:23] (03PS2) 10Chad: Revert "Enable caching of constraint check results" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414720 [17:39:30] (03CR) 10Chad: [V: 032 C: 032] Revert "Enable caching of constraint check results" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414720 (owner: 10Chad) [17:39:31] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.846 second response time [17:39:48] (03CR) 10Bstorm: ">" [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:39:50] (03CR) 10jenkins-bot: Revert "Enable caching of constraint check results" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414720 (owner: 10Chad) [17:39:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "Doesn’t hurt too much even if it’s unrelated – it just means some API requests will take longer again, just as they did before today’s SWA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414720 (owner: 10Chad) [17:40:34] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#4002381 (10Papaul) a:05Papaul>03Marostegui Disk placement complete. [17:40:48] (03CR) 10Arturo Borrero Gonzalez: "> would it be possible to grab the non-floating-ip assigned instance" [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:41:02] !log demon@tin Synchronized wmf-config/Wikibase-production.php: Revert "Enable caching of constraint check results" (duration: 00m 57s) [17:41:05] demon@tin: Failed to log message to wiki. Somebody should check the error logs. [17:41:05] marostegui, jynus ^^^^ [17:41:14] hu, failed to log? [17:41:16] thanks [17:41:17] no_justification: is it deployed then? [17:41:23] greg-g: yeah, !log is under maintenance [17:41:27] gotcha [17:41:29] I am going to kill gain [17:41:37] to see they don't come back [17:41:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:41:58] if someone else can check errors also go down [17:41:58] jynus: cool, I can see some small downs on the graph, but those are from the killing probably [17:42:02] I am [17:44:32] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:00] things seem under control [17:45:17] I am almost sure the revert fixed it [17:45:23] yes [17:45:28] or, in other words, the patch was the cause [17:45:32] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.945 second response time [17:47:06] okay, so now I need to figure out what the hell my code was doing wrong… [17:47:15] :) [17:48:25] breaking wikidata :-) [17:48:30] xddd [17:48:34] you should communicate with users [17:48:41] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:42] and explain there was some slowdown [17:48:48] but I wasn’t the one who reverted the change, so I don’t deserve a T-shirt? ;) [17:48:53] maybe an incident report [17:49:13] the slowdown was slow to buildup [17:49:28] so it was not detected by monitoring inmediatelyu [17:50:46] Lucas_WMDE: this is the best way to see it: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=11&fullscreen&orgId=1&from=1519592991194&to=1519667375278&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [17:51:07] oh dear lord [17:51:12] 24-second extra latency while normally it should be 200 ms [17:51:39] Lucas_WMDE: and this was the network used on an affected server: https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=10&fullscreen&orgId=1&var-server=db1109&var-network=eth0&from=now-6h&to=now&refresh=1m [17:51:41] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 11 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:51:54] marostegui: actually it was affecting all servers [17:52:04] jynus: yeah, it was an example of a server :) [17:52:10] ah, ok [17:52:22] a prometheus alert would be nice [17:52:30] I really don’t understand how this could happen, though… [17:53:41] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.496 second response time [17:56:27] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#4002456 (10RobH) >>! In T187723#3983455, @elukey wrote: > I support the request, and it might be wise to allow this simple diff for the wh... [17:56:47] (03PS2) 10RobH: admin::data: allow analytics-admins to sudo as yarn [puppet] - 10https://gerrit.wikimedia.org/r/412704 (https://phabricator.wikimedia.org/T187723) (owner: 10Elukey) [17:56:51] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:00] elukey: did you wanna merge your patchset for yarn sudo? [17:57:11] (can wait post meeting but didnt wanna let it sit abandoned) [17:57:41] robh: sure I can do it now [17:57:54] !log mobrovac@tin Started restart [electron-render/deploy@94d27d7]: Stuck, restart - T174916 [17:57:57] mobrovac@tin: Failed to log message to wiki. Somebody should check the error logs. [17:57:58] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [17:58:04] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4002471 (10Marostegui) The rebuilt failed for this disk, I guess this disk was not in a good state: ``` PD: 0 Information Enclosure Device ID: 32 Slot Number: 2 Drive's position: DiskGroup: 0, Span: 1, Arm:... [17:58:35] (03CR) 10Elukey: [C: 032] admin::data: allow analytics-admins to sudo as yarn [puppet] - 10https://gerrit.wikimedia.org/r/412704 (https://phabricator.wikimedia.org/T187723) (owner: 10Elukey) [17:58:51] cool [17:59:41] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [18:00:04] gehel: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180226T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:06] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#4002487 (10elukey) 05Open>03Resolved [18:00:27] jouncebot: Deploy new Updater, new GUI and new whitelist.txt coming up... [18:02:10] (03CR) 10Chad: [V: 032 C: 032] Add build documentation on building the plugin [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414698 (owner: 10Paladox) [18:02:49] (03CR) 10Chad: [C: 032] ExtensionDistributor: Ignore empty repositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414612 (owner: 10Legoktm) [18:04:20] (03Merged) 10jenkins-bot: ExtensionDistributor: Ignore empty repositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414612 (owner: 10Legoktm) [18:04:24] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4002505 (10brion) Nice. :D [18:05:21] !log gehel@tin Started deploy [wdqs/wdqs@4edbbaa]: new update, GUI and whitelist.txt [18:05:23] gehel@tin: Failed to log message to wiki. Somebody should check the error logs. [18:06:57] (03CR) 10jenkins-bot: ExtensionDistributor: Ignore empty repositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414612 (owner: 10Legoktm) [18:08:37] (03PS1) 10Awight: Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) [18:10:05] !log gehel@tin Finished deploy [wdqs/wdqs@4edbbaa]: new update, GUI and whitelist.txt (duration: 04m 44s) [18:10:08] gehel@tin: Failed to log message to wiki. Somebody should check the error logs. [18:10:29] anyone knows which error logs to check for stashbot? [18:11:17] (03PS1) 10Greg Grossmeier: beta: add fr.wikipedia for LE cert [puppet] - 10https://gerrit.wikimedia.org/r/414730 (https://phabricator.wikimedia.org/T188288) [18:11:27] SMalyshev: deployment completed, tests are green (except wdqs1004, still down - T188045) [18:11:28] T188045: wdqs1004 broken - https://phabricator.wikimedia.org/T188045 [18:11:47] gehel: 17:41:23 marostegui | greg-g: yeah, !log is under maintenance [18:12:06] Oh, I missed that one... [18:12:13] greg-g: thanks! [18:12:14] marostegui: who's maintenance'ing logs/stashbot? [18:12:32] greg-g: andrewbogott is working on wikitech and it is set to read_only [18:13:02] I hope to have it back soon but there are a fair number of unknowns [18:13:06] marostegui: ahhhhhh [18:13:11] andrewbogott: godspeed [18:13:51] !log demon@tin Synchronized wmf-config/CommonSettings.php: ExtensionDistributor: Ignore empty repositories (duration: 00m 56s) [18:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:39] (03PS5) 10Dzahn: lists: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409480 [18:16:45] (03CR) 10Greg Grossmeier: "Is this all that needs to be done to get a LE cert for a domain in Beta Cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/414730 (https://phabricator.wikimedia.org/T188288) (owner: 10Greg Grossmeier) [18:18:31] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [18:20:42] (03PS1) 10Andrew Bogott: wikitech: use 'labswiki' database on m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414733 (https://phabricator.wikimedia.org/T188029) [18:22:23] (03CR) 10Dzahn: [C: 031] "wasn't involved in setting this up, but after looking at code.. looks like this is all, yea" [puppet] - 10https://gerrit.wikimedia.org/r/414730 (https://phabricator.wikimedia.org/T188288) (owner: 10Greg Grossmeier) [18:22:32] (03CR) 10Dzahn: [C: 032] beta: add fr.wikipedia for LE cert [puppet] - 10https://gerrit.wikimedia.org/r/414730 (https://phabricator.wikimedia.org/T188288) (owner: 10Greg Grossmeier) [18:23:45] thanks mutante :) [18:24:50] greg-g: you're welcome. .. when logging in on that machine i see though that last puppet run as 19072 minutes ago [18:24:56] mutante: added you because of https://gerrit.wikimedia.org/r/c/386077 (you +2'd it) :) [18:24:58] !log gehel@tin Started deploy [wdqs/wdqs@f74cbd1]: new forAllCategoryWikis.sh [18:25:03] mutante: ugh [18:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:42] yea, there is usually some other issue [18:25:51] but means we cant apply it [18:26:08] "not find data item profile::cache::kafka::webrequest::kafka_cluster_name in any Hiera data file" [18:27:42] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4002694 (10Gehel) a:03Gehel Hardware diagnostic is running, I'll report back with the results when completed. [18:31:27] !log gehel@tin Finished deploy [wdqs/wdqs@f74cbd1]: new forAllCategoryWikis.sh (duration: 06m 28s) [18:31:38] SMalyshev: ^ done ! [18:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:42] (03CR) 10Rush: "while it should be possible to know the floating ip in these cases before hand and to verify which instance it is assigned to during dynam" [puppet] - 10https://gerrit.wikimedia.org/r/414657 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [18:36:14] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252#4002716 (10Smalyshev) p:05Triage>03Normal [18:44:14] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002756 (10RobH) p:05Triage>03Normal [18:49:38] (03PS1) 10Dzahn: deployment-prep: set profile::cache::kafka::webrequest::kafka_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/414738 (https://phabricator.wikimedia.org/T188288) [18:50:04] (03PS5) 10BBlack: rps: change IRQs without reboot on bnx2x [puppet] - 10https://gerrit.wikimedia.org/r/414676 [18:50:06] (03PS1) 10BBlack: Add net_driver fact [puppet] - 10https://gerrit.wikimedia.org/r/414739 [18:50:08] (03PS1) 10BBlack: lvs - use new fact to determine bnx2x [puppet] - 10https://gerrit.wikimedia.org/r/414740 [18:51:00] (03PS2) 10Dzahn: deployment-prep: set profile::cache::kafka::webrequest::kafka_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/414738 (https://phabricator.wikimedia.org/T188288) [18:51:23] (03CR) 10Dzahn: [C: 032] deployment-prep: set profile::cache::kafka::webrequest::kafka_cluster_name [puppet] - 10https://gerrit.wikimedia.org/r/414738 (https://phabricator.wikimedia.org/T188288) (owner: 10Dzahn) [18:51:48] greg-g / : T184812 no longer needs to be UBN!, right? you reverted the change, everything fine again for now. (and it’s a config change, so we don’t need to worry about the next train reintroducing the issue either.) [18:51:48] T184812: Enable constraint result caching on Wikidata - https://phabricator.wikimedia.org/T184812 [18:52:07] oops, IRC client dropped the mention – marostegui ^ [18:52:18] Lucas_WMDE: right [18:52:24] ok thanks [18:52:34] updated [18:54:19] PROBLEM - mysqld processes on silver is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:54:46] !log disabling puppet agents and rebooting codfw puppet masters for kernel update [18:54:51] ^ should be known, apologies. andrewbogott ^^ silver [18:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:22] * andrewbogott fights with icinga to mute [18:59:30] * James_F is here, for when the bot pings. [19:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180226T1900). Please do the needful. [19:00:04] James_F, Zoranzoki21, and Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:01:16] James_F: I'm going to want to deploy my change either last or not at all… will update in a few [19:01:22] !log codfw puppet master kernel updates complete — re-enabling puppet agents [19:01:25] herron: Failed to log message to wiki. Somebody should check the error logs. [19:01:26] OK. [19:01:38] um… also of course my change isn't on wikitech because it's read-only :) [19:01:43] !log -- [19:01:45] herron: Failed to log message to wiki. Somebody should check the error logs. [19:01:55] !log codfw puppet master kernel updates complete re-enabling puppet agents [19:01:57] herron: Failed to log message to wiki. Somebody should check the error logs. [19:02:56] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3867914 (10faidon) So for some reason (WMCS bad luck!), these seem to have been ordered with Intel NIC daughter cards. We have had Intel NICs only in the distant past, 99% of our 10G f... [19:03:14] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4002878 (10faidon) a:05Cmjohnson>03RobH [19:04:19] Is anyone doing the SWAT or should I do it? [19:05:07] RoanKattouw: Could you? [19:05:37] RECOVERY - mysqld processes on silver is OK: PROCS OK: 1 process with command name mysqld [19:06:19] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002884 (10elukey) > OS Version: Existing hadoop worker nodes use Jessie. Can these new hosts be stretch? We still haven't tested Hadoop packages on stretch... [19:06:35] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002886 (10elukey) [19:06:42] James_F: sorry for the commotion; I've rolled back my changes for the moment and you should be able to log normally for now [19:06:59] OK. [19:07:56] James_F: So help me understand your config patch [19:08:05] !log codfw puppet master kernel updates complete re-enabling puppet agents [19:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:28] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002892 (10RobH) [19:09:19] RoanKattouw: Stack of patches, starting with that, then once that's deployed, https://gerrit.wikimedia.org/r/#/c/413656/ will go out in the train. [19:09:40] RoanKattouw: Changing the meaning of wgVisualEditorEnableWikitext and adding wgVisualEditorEnableWikitextBetaFeature. [19:09:58] (03CR) 10Dzahn: [C: 032] "this remove one error but due to:" [puppet] - 10https://gerrit.wikimedia.org/r/414738 (https://phabricator.wikimedia.org/T188288) (owner: 10Dzahn) [19:11:01] (03PS6) 10Dzahn: lists: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409480 [19:11:56] (03PS4) 10Catrope: Add mushroomobserver.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [19:12:01] (03CR) 10Catrope: [C: 032] Add mushroomobserver.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [19:12:31] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4002903 (10chasemp) >>! In T183937#4002875, @faidon wrote: >Any disagreements? Nope, please and thank you. Really appreciate you and DC Ops working through our puzzles. [19:12:35] James_F: Oh I see, you're setting a cofig var that doesn't exist yet [19:12:42] PROBLEM - HHVM rendering on mw2150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:47] RoanKattouw: Yeah, rather than break the world with the train. :-) [19:12:47] (03CR) 10Catrope: [C: 032] 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 (owner: 10Jforrester) [19:13:27] (03Merged) 10jenkins-bot: Add mushroomobserver.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [19:13:41] (03CR) 10jenkins-bot: Add mushroomobserver.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414401 (https://phabricator.wikimedia.org/T188203) (owner: 10Zoranzoki21) [19:13:41] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 82081 bytes in 1.060 second response time [19:14:18] (03CR) 10Catrope: [C: 032] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414704 (https://phabricator.wikimedia.org/T188129) (owner: 10Urbanecm) [19:14:27] 10Operations: TransparencyReport-private is not auto deploying - https://phabricator.wikimedia.org/T188224#4002915 (10APalmer_WMF) Thanks, everyone! We spoke with @Catrope last week, and he was able to get it working again. Is there any way to determine why it happened / make sure it doesn't happen again? [19:15:46] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414704 (https://phabricator.wikimedia.org/T188129) (owner: 10Urbanecm) [19:16:33] Lucas_WMDE: Are you around to verify the SWAT deployment of https://gerrit.wikimedia.org/r/#/c/414714/ ? [19:16:44] I am around [19:17:01] but currently writing an incident report for an outage caused by a change I had in the earlier SWAT [19:17:03] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414704 (https://phabricator.wikimedia.org/T188129) (owner: 10Urbanecm) [19:17:08] so, just so you know… :) [19:17:18] I won’t be mad if you say “this guy can’t be trusted right now not to break the wiki” [19:17:28] I think this change is fine, but I thought so of the other one too [19:18:22] (03PS2) 10Catrope: 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 (owner: 10Jforrester) [19:18:29] (03CR) 10Catrope: 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 (owner: 10Jforrester) [19:18:32] (03CR) 10Catrope: [C: 032] 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 (owner: 10Jforrester) [19:19:58] haha OK no worries [19:20:00] (03Merged) 10jenkins-bot: 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 (owner: 10Jforrester) [19:20:17] (03CR) 10jenkins-bot: 2017 wikitext editor: Simplify config part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413651 (owner: 10Jforrester) [19:21:53] * greg-g side-eyes both of you [19:22:21] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add mushroomobserver.org to wgCopyUploadsDomains (T188203) (duration: 00m 57s) [19:22:25] * awight puts on an extremely trustworthy expression in anticipation of the scap stare [19:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:35] T188203: Please add to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T188203 [19:23:27] I'm around in the office for a while in case things break [19:23:41] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1800.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:23:49] (03CR) 10Herron: [C: 031] hieradata: depool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/414706 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [19:24:06] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4002973 (10RobH) p:05Triage>03Normal [19:24:28] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4002987 (10RobH) Before these are racked, I'd like someone to review my racking proposal: Racking Proposal: mw systems in codfw have been racked in the #3 and #4 racks in each row. Presently, there is a bi... [19:26:51] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4002998 (10greg) [19:26:52] !log catrope@tin Synchronized wmf-config/throttle.php: Add throttle rule (T188129) (duration: 00m 56s) [19:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:08] T188129: Request for allowance of multiple account registers from same IP for 2018-02-27 - https://phabricator.wikimedia.org/T188129 [19:28:18] RoanKattouw: could you ping me when you're done? need to deploy some config cleanups [19:28:34] Sure will do [19:29:20] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Simplify 2017 wikitext editor config (part 1) (duration: 00m 54s) [19:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:57] (03CR) 10Dzahn: [C: 032] lists: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409480 (owner: 10Dzahn) [19:32:10] (03CR) 10Dzahn: [C: 032] "no-op on fermium .. nothing at all" [puppet] - 10https://gerrit.wikimedia.org/r/409480 (owner: 10Dzahn) [19:32:16] RoanKattouw: Thanks. [19:35:01] (03CR) 10Herron: [C: 031] "in addition I think it would be more intuitive if all puppet masters followed the same naming convention. but for the purposes of upgradi" [puppet] - 10https://gerrit.wikimedia.org/r/414707 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [19:35:17] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4003039 (10phuedx) [19:36:35] (03PS1) 10Dzahn: lists: move httpd class to role [puppet] - 10https://gerrit.wikimedia.org/r/414748 [19:38:33] (03CR) 10Dzahn: [C: 032] "see how this was ok but did not fix the style violation of including from another module. https://gerrit.wikimedia.org/r/#/c/414748/ will" [puppet] - 10https://gerrit.wikimedia.org/r/409480 (owner: 10Dzahn) [19:39:18] (03CR) 10Dzahn: "@akosiaris since we recently talked about the location of the httpd declaration. this is an example why i move it to role classes" [puppet] - 10https://gerrit.wikimedia.org/r/414748 (owner: 10Dzahn) [19:40:37] (03CR) 10Dzahn: [C: 032] lists: move httpd class to role [puppet] - 10https://gerrit.wikimedia.org/r/414748 (owner: 10Dzahn) [19:40:40] Lucas_WMDE: OK, your patch is on mwdebug1002, please test [19:40:57] Sorry that it took so long, Jenkins took a while and then I got distracted and briefly forgot that I was doing the SWAT [19:41:07] no problem [19:41:11] seems to be working [19:41:16] and I’ll quickly check that no grafana boards are exploding ;) [19:41:29] (though I guess that wouldn’t happen just from mwdebug, hm) [19:42:12] (03CR) 10Dzahn: [C: 032] "still everything no-op on fermium" [puppet] - 10https://gerrit.wikimedia.org/r/414748 (owner: 10Dzahn) [19:42:42] RoanKattouw: everything okay as far as I can tell [19:42:44] * Lucas_WMDE crosses fingers [19:43:37] (03PS2) 10Dzahn: varnish: add misc director for design.wm.org -> bromine [puppet] - 10https://gerrit.wikimedia.org/r/413986 (https://phabricator.wikimedia.org/T185282) [19:44:05] Keegan: beta eswiki is locked (closed) so I doubt users could do much testing (this is in reply of your announcement on my talk; oh and fwiw I've been testing them already :) ) [19:44:29] Ha, okay, thanks. I'll remove it :) [19:44:53] (03CR) 10Dzahn: [C: 032] varnish: add misc director for design.wm.org -> bromine [puppet] - 10https://gerrit.wikimedia.org/r/413986 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [19:45:01] Alright here goes [19:46:26] !log running puppet on cache::misc servers to add new director for design.wm [19:46:36] !log catrope@tin Synchronized php-1.31.0-wmf.22/extensions/WikibaseQualityConstraints/: T184937 (duration: 01m 03s) [19:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:54] T184937: Change wbcheckconstraints’ status parameter’s default value to cacheable value - https://phabricator.wikimedia.org/T184937 [19:54:12] (03PS1) 10Pmiazga: Enable HTML Previews on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) [19:56:59] (03CR) 10Jdlrobson: [C: 031] Enable HTML Previews on all wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [19:57:23] James_F: is SWAT all done or still in progress? (I'm going to break wikitech again) [19:57:57] (03PS2) 10Dzahn: xhgui: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/410620 [19:58:44] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10143/tungsten.eqiad.wmnet/ and "delta -3"" [puppet] - 10https://gerrit.wikimedia.org/r/410620 (owner: 10Dzahn) [19:59:51] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:00:30] (03CR) 10Dzahn: [C: 032] "no-op on tungsten. https://performance.wikimedia.org/xhgui is fine" [puppet] - 10https://gerrit.wikimedia.org/r/410620 (owner: 10Dzahn) [20:00:43] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414755 [20:01:03] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4002973 (10Papaul) @Robh since I have rack space to covert in B3 (9-17) what about not put anything in A3 and put 7 hosts in B3 see below |rack|systems| |A4|5| |B3|7| |D3|10| |D4|10| [20:01:41] (03PS2) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414755 (https://phabricator.wikimedia.org/T188292) [20:02:26] RoanKattouw: thanks for the deploy btw! [20:02:59] server board and mysql-aggregated still look okay as wel [20:03:19] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4003134 (10RobH) >>! In T188301#4003113, @Papaul wrote: > @Robh since I have rack space to covert in B3 (9-17) what about not put anything in A3 and put 7 hosts in B3 see below > |rack|systems| > |A4|5| >... [20:03:44] anybody knows who owns noc.wikimedia.org? [20:03:57] Nobody really, but whats up? [20:04:01] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#4003141 (10Papaul) [20:04:01] docroots are in wmf-config [20:04:04] there has been some change in URLs there and I am trying to figure out where it comes from [20:04:10] Yes, I did some stuff [20:04:11] RoanKattouw: still swatting? [20:04:12] Friday :) [20:04:14] ahh, ok, so it's wmf-config [20:04:23] andrewbogott: Sorry I'm done now [20:04:25] cc MaxSem [20:04:30] great, thanks [20:04:32] (03CR) 10Pmiazga: "All wikis were using the RestbasePlain endpoint, now we want to use the HTML endpoint. I'll check with services whether is possible to en" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [20:04:35] I'm going to disable !log again [20:04:50] (03PS2) 10Awight: Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) [20:04:52] aww [20:05:01] Ahhh, the "raw" links are still busted [20:05:04] I can fix that! [20:05:44] no_justification: so you moved dblists to subdir on noc, correct? [20:06:11] MaxSem: wanna throw https://gerrit.wikimedia.org/r/#/c/414729/ into your deployment, or just lmk when you’re done and I can deploy? [20:06:11] Yep [20:06:19] no_justification: is there some config which lists this path? it broke categories dump to wdqs pipeline, since URLs were all wrong [20:06:28] (03PS1) 10Urbanecm: Publish throttle-analyze at noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414758 (https://phabricator.wikimedia.org/T187894) [20:06:33] It's just what you see in the noc docroot, no config [20:06:42] But basically, put them in that subdirectory [20:06:51] I fixed the URL but I'd like for it not to happen again [20:06:54] noc.wm.o/conf/all.dblist -> noc.wm.o/conf/dblists/all.dblist [20:06:57] It won't [20:07:00] It was a one-time change [20:07:35] (03CR) 10Ppchelko: Enable HTML Previews on all wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [20:08:02] so no config that has that path? would appreciate an announcement next time then :) [20:08:45] mea culpa, I'll mention it next time [20:08:57] And nope, no config. noc.wm.o is very hacky and ad hoc [20:08:57] (03PS2) 10Dzahn: admins: Add imarlier to udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/414668 (https://phabricator.wikimedia.org/T188042) (owner: 10Muehlenhoff) [20:09:34] ok, I'll keep hardcoding it then :) [20:09:44] (03CR) 10Dzahn: [C: 031] "right group for access to oxygen as requested (and a little bit more, but seems good)" [puppet] - 10https://gerrit.wikimedia.org/r/414668 (https://phabricator.wikimedia.org/T188042) (owner: 10Muehlenhoff) [20:09:58] (03PS1) 10Chad: noc.wm.o: Stop urlencoding filenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414759 [20:10:55] (03PS1) 10Ppchelko: [JoqbQueue] Switch refreshLinks for all but wikipedia and wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414760 (https://phabricator.wikimedia.org/T185052) [20:11:12] 10Operations, 10Wikimedia-Site-requests: Remove 'moodbar-admin' from 'staff' global group - https://phabricator.wikimedia.org/T188278#4003163 (10Dzahn) [20:12:18] 10Operations, 10ops-codfw, 10netops: switch port configuration for wdq200[4-6] - https://phabricator.wikimedia.org/T188303#4003165 (10Papaul) p:05Triage>03Normal [20:13:48] MaxSem: If you’re not deploying yet, then I’ll jump in to SWAT my config change, shall I? [20:15:04] ^ I shall… [20:15:24] (03CR) 10Awight: [C: 032] Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:16:27] 10Operations, 10Wikimedia-Site-requests: Remove 'moodbar-admin' from 'staff' global group - https://phabricator.wikimedia.org/T188278#4003216 (10demon) 05Open>03Resolved a:03demon Done. [20:16:41] RECOVERY - HP RAID on db2048 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [20:16:49] (03CR) 10Legoktm: "Has this extension been reviewed yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:16:57] (03Merged) 10jenkins-bot: Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:17:14] 10Operations, 10Wikimedia-Site-requests: Remove 'moodbar-admin' from 'staff' global group - https://phabricator.wikimedia.org/T188278#4003236 (10demon) (Just the moodbar one, I'd like more info on the sendemail-new-users bit) [20:17:17] (03CR) 10jenkins-bot: Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:17:18] awight: uhh [20:17:23] (03CR) 10Awight: [C: 032] "> Has this extension been reviewed yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:17:34] (03Draft1) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/414763 [20:17:35] awight: https://www.mediawiki.org/wiki/Review_queue#Preparing_for_deployment [20:17:36] (03Draft2) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/414763 [20:17:52] (03CR) 10Legoktm: "https://www.mediawiki.org/wiki/Review_queue#Preparing_for_deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414729 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:17:58] legoktm: This is just the beta cluster, is that necessary? [20:18:02] yes [20:18:06] kk lemme revert [20:18:15] (03PS1) 10Awight: Revert "Enable Extension:JADE on all beta cluster wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414764 [20:18:22] (03CR) 10Awight: [C: 032] Revert "Enable Extension:JADE on all beta cluster wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414764 (owner: 10Awight) [20:18:24] (03PS3) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/414763 [20:18:52] legoktm: You’re right, thanks for calling that out. [20:19:47] I read “production deployment tracking task arf arf arf”, if you’re a Gary Larson fan… [20:21:02] thanks legoktm and awight :) yes, it's required :) [20:21:05] Also: please please please do not use extension-list-labs [20:21:25] Oh wait you did it right [20:21:30] (03Merged) 10jenkins-bot: Revert "Enable Extension:JADE on all beta cluster wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414764 (owner: 10Awight) [20:21:33] I reflexively was upset! [20:21:34] <3 [20:21:45] lol [20:21:49] 10Operations, 10ops-codfw: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4003262 (10Papaul) a:03Papaul [20:22:10] no_justification: I did notice the file, and decided I didn’t want to keep that sort of company [20:22:14] (03CR) 10Mobrovac: Enable HTML Previews on all wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [20:24:11] (03Abandoned) 10Krinkle: highlight.php: Don't use the escaped URL for the raw URL either [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413939 (owner: 10Chad) [20:24:16] (03CR) 10Krinkle: [C: 031] noc.wm.o: Stop urlencoding filenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414759 (owner: 10Chad) [20:27:01] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [20:27:54] awight: That file is about to be deleted ;-) [20:28:30] I also didn’t edit noc config :p [20:28:53] (03PS1) 10Chad: Move FileImporter/FileExporter to general extension setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414767 [20:30:05] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:30:46] (03CR) 10Chad: [C: 032] noc.wm.o: Stop urlencoding filenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414759 (owner: 10Chad) [20:32:16] (03PS1) 10Pmiazga: Enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) [20:32:20] (03Merged) 10jenkins-bot: noc.wm.o: Stop urlencoding filenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414759 (owner: 10Chad) [20:33:06] (03PS1) 10Awight: [DNM] Enable Extension:JADE on all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414771 (https://phabricator.wikimedia.org/T176333) [20:33:18] (03PS2) 10Jdlrobson: Enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T186728) (owner: 10Pmiazga) [20:33:21] (03PS3) 10Pmiazga: beta: enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) [20:33:35] (03CR) 10Awight: "Waiting for security review: T188308" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414771 (https://phabricator.wikimedia.org/T176333) (owner: 10Awight) [20:33:50] (03CR) 10Jdlrobson: [C: 031] beta: enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) (owner: 10Pmiazga) [20:34:11] !log demon@tin Synchronized docroot/noc/conf/: Fix urlencoding (duration: 00m 57s) [20:34:16] demon@tin: Failed to log message to wiki. Somebody should check the error logs. [20:34:30] yuck [20:34:34] (03CR) 10jenkins-bot: Revert "Enable Extension:JADE on all beta cluster wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414764 (owner: 10Awight) [20:34:36] (03CR) 10jenkins-bot: noc.wm.o: Stop urlencoding filenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414759 (owner: 10Chad) [20:35:08] Krenair: All the links work again! [20:35:09] Yay! [20:35:15] (03PS4) 10Pmiazga: beta: enable VirtualPagePreviews events on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414769 (https://phabricator.wikimedia.org/T184793) [20:37:37] 10Operations, 10Wikimedia-Incident: Detect high server load earlier – prometheus alert? - https://phabricator.wikimedia.org/T188317#4003428 (10Lucas_Werkmeister_WMDE) [20:37:51] (03CR) 10Pmiazga: "@Ppchelko, @Mobrovac - thanks for your input. For now, we decided to keep it for wikis only. Later we will have another review of all proj" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414751 (https://phabricator.wikimedia.org/T182319) (owner: 10Pmiazga) [20:43:41] (03PS1) 10Rush: openstack: groundwork for labtestn on mitaka [puppet] - 10https://gerrit.wikimedia.org/r/414773 (https://phabricator.wikimedia.org/T188266) [20:52:26] (03PS2) 10Rush: openstack: groundwork for labtestn on mitaka [puppet] - 10https://gerrit.wikimedia.org/r/414773 (https://phabricator.wikimedia.org/T188266) [20:55:11] (03PS2) 10Ppchelko: [JoqbQueue] Switch refreshLinks for all but wikipedia and wiktionary. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414760 (https://phabricator.wikimedia.org/T185052) [20:58:47] (03PS3) 10Rush: openstack: groundwork for labtestn on mitaka [puppet] - 10https://gerrit.wikimedia.org/r/414773 (https://phabricator.wikimedia.org/T188266) [20:59:18] (03PS4) 10Rush: openstack: groundwork for labtestn on mitaka [puppet] - 10https://gerrit.wikimedia.org/r/414773 (https://phabricator.wikimedia.org/T188266) [21:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180226T2100). Please do the needful. [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:22] arlo will be doing a parsoid deploy [21:00:22] PROBLEM - HHVM rendering on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:01:12] RECOVERY - HHVM rendering on mw2121 is OK: HTTP OK: HTTP/1.1 200 OK - 82088 bytes in 0.258 second response time [21:05:37] (03CR) 10Rush: [C: 032] openstack: groundwork for labtestn on mitaka [puppet] - 10https://gerrit.wikimedia.org/r/414773 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:14:41] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 1767 MB (3% inode=96%) [21:17:09] !log mholloway-shell@tin Started deploy [mobileapps/deploy@9970f97]: Update mobileapps to 8aa38e7 [21:17:11] mholloway-shell@tin: Failed to log message to wiki. Somebody should check the error logs. [21:17:41] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 1489 MB (3% inode=96%) [21:20:02] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [21:21:44] (03PS1) 10Legoktm: Fix $wgShellRestrictionMethod typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414830 (https://phabricator.wikimedia.org/T188039) [21:21:46] (03CR) 10Legoktm: [C: 032] Fix $wgShellRestrictionMethod typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414830 (https://phabricator.wikimedia.org/T188039) (owner: 10Legoktm) [21:23:02] (03Merged) 10jenkins-bot: Fix $wgShellRestrictionMethod typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414830 (https://phabricator.wikimedia.org/T188039) (owner: 10Legoktm) [21:23:16] (03CR) 10jenkins-bot: Fix $wgShellRestrictionMethod typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414830 (https://phabricator.wikimedia.org/T188039) (owner: 10Legoktm) [21:23:42] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@9970f97]: Update mobileapps to 8aa38e7 (duration: 06m 33s) [21:23:44] mholloway-shell@tin: Failed to log message to wiki. Somebody should check the error logs. [21:24:46] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Fix $wgShellRestrictionMethod typo - T188039 (duration: 00m 57s) [21:24:49] legoktm@tin: Failed to log message to wiki. Somebody should check the error logs. [21:25:43] (03PS1) 10Legoktm: Revert "Fix $wgShellRestrictionMethod typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414831 [21:25:46] (03CR) 10Legoktm: [C: 032] Revert "Fix $wgShellRestrictionMethod typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414831 (owner: 10Legoktm) [21:27:30] !log arlolra@tin Started deploy [parsoid/deploy@cf9b02e]: Updating Parsoid to 24c783c [21:27:32] arlolra@tin: Failed to log message to wiki. Somebody should check the error logs. [21:28:25] (03CR) 10Legoktm: [V: 032 C: 032] Revert "Fix $wgShellRestrictionMethod typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414831 (owner: 10Legoktm) [21:28:48] arlolra, wikitech wiki is currently locked for migration. so, you will have to manually update the sal page once this is back to read-write [21:29:02] locked for *db migration [21:29:42] ah, looks like you are not the only one. [21:29:47] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Revert Fix $wgShellRestrictionMethod typo - T188039 (duration: 00m 55s) [21:29:49] legoktm@tin: Failed to log message to wiki. Somebody should check the error logs. [21:30:44] (03CR) 10jenkins-bot: Revert "Fix $wgShellRestrictionMethod typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414831 (owner: 10Legoktm) [21:32:53] subbu: Ideally someone can grab all the missed !log entries and do it en masse :) [21:33:07] no_justification, yup. [21:33:47] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#4003598 (10cwdent) @cmjohnson - the console pw needs changed, i will get it to you securely [21:40:42] 19<no_justification> Krenair: All the links work again! [21:40:42] 18<no_justification18> Yay! [21:40:45] I think this was for Krinkle [21:40:52] Whoops, yes it was [21:40:53] Haha [21:40:57] But yay anyway? [21:40:58] :p [21:42:25] yay indeed [21:42:26] !log arlolra@tin Finished deploy [parsoid/deploy@cf9b02e]: Updating Parsoid to 24c783c (duration: 14m 57s) [21:42:29] arlolra@tin: Failed to log message to wiki. Somebody should check the error logs. [21:45:04] no_justification: nice [21:47:05] Krinkle: That killed /70/ symlinks [21:47:13] Yay for less indirection in wmf-config! [21:48:23] (03CR) 10Chad: [C: 032] Move FileImporter/FileExporter to general extension setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414767 (owner: 10Chad) [21:51:57] PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [21:52:53] (03CR) 10jerkins-bot: [V: 04-1] Move FileImporter/FileExporter to general extension setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414767 (owner: 10Chad) [21:52:58] RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.050 second response time [21:57:43] (03CR) 10Chad: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414767 (owner: 10Chad) [21:59:17] (03Merged) 10jenkins-bot: Move FileImporter/FileExporter to general extension setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414767 (owner: 10Chad) [21:59:28] (03CR) 10jenkins-bot: Move FileImporter/FileExporter to general extension setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414767 (owner: 10Chad) [22:00:04] bawolff and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180226T2200). [22:00:04] No GERRIT patches in the queue for this window AFAICS. [22:00:57] (03PS2) 10Catrope: Enable ORES filters on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395818 (https://phabricator.wikimedia.org/T182012) [22:02:55] (03PS2) 10Andrew Bogott: wikitech: use 'labswiki' database on m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414733 (https://phabricator.wikimedia.org/T188029) [22:03:16] !log testing the log by logging a test [22:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:30] (03PS3) 10Herron: WIP: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 [22:05:34] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (owner: 10Herron) [22:05:42] !log logging a log to test logging a log [22:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:32] !log made mysql on silver read-only, hopefully for good. T188029 [22:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:46] T188029: Move labswiki database to m5 - https://phabricator.wikimedia.org/T188029 [22:09:22] !log hotfixed mediawiki on silver to use m5-master for wikitech. This will be finalized with the merge of https://gerrit.wikimedia.org/r/#/c/414733/ [22:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:17] subbu: if you don't mind being a test subject, you can try logging things now. Worst case is I'll have to revert back to the read-only version earlier. [22:11:06] andrewbogott, ok. [22:12:13] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[python-glanceclient],Package[python-openstackclient],Package[python-designateclient] [22:13:02] andrewbogott, success .. https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1783718&oldid=1783711 [22:13:19] but, looks like RoanKattouw beat me to it by 8 mins! :) [22:13:50] Hm? [22:13:53] I didn't log anything? [22:13:55] (03CR) 10Herron: [C: 04-1] puppetmaster: use puppetdb-termini on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413690 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [22:14:52] RoanKattouw: I think he was talking about the deployment schedule [22:15:33] RoanKattouw, sorry .. yes .. deployments page. i had been blocked earlier because of the db migration and andrew asked me to test whether i can get through now. [22:22:13] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:25:24] so where should "we" run sql commands for wikitech? [22:25:34] (if silver is now read-only only) [22:34:00] (03CR) 10MarcoAurelio: [C: 031] "If you want to have this merged, do not forget to add it to https://wikitech.wikimedia.org/wiki/Deployments or ask someone else to do it f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [22:36:00] (03PS1) 10Rush: openstack: deal with competing version priorties on jessie [puppet] - 10https://gerrit.wikimedia.org/r/414842 (https://phabricator.wikimedia.org/T188266) [22:36:11] (03PS2) 10Rush: openstack: deal with competing version priorties on jessie [puppet] - 10https://gerrit.wikimedia.org/r/414842 (https://phabricator.wikimedia.org/T188266) [22:36:27] andrewbogott: Regarding CommonSettings.php, any reason not to commit the change to git? It should be fine to have it in an if conditional. [22:36:36] That way it won't be overwritten, nor require a lock. [22:37:22] (03CR) 10Rush: [C: 032] openstack: deal with competing version priorties on jessie [puppet] - 10https://gerrit.wikimedia.org/r/414842 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [22:39:46] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4003858 (10RobH) >>! In T188075#4001386, @faidon wrote: > Sounds good. Note that eqiad has 6 imagescalers (mw1293-mw1298) and codfw has 4 now ( mw2244-2245/mw2150-2151) b... [22:46:31] RECOVERY - Disk space on rhenium is OK: DISK OK [22:47:21] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:49:43] +1 to Krinkle [22:54:22] 10Operations, 10Wikimedia-Incident: Detect high server load earlier – prometheus alert? - https://phabricator.wikimedia.org/T188317#4003947 (10Lucas_Werkmeister_WMDE) [22:54:43] 10Operations, 10Wikimedia-Incident: Detect high server load earlier – prometheus alert? - https://phabricator.wikimedia.org/T188317#4003428 (10Lucas_Werkmeister_WMDE) [22:54:50] 10Operations, 10Ops-Access-Requests: reinstate ezachte's access - https://phabricator.wikimedia.org/T188335#4003953 (10RobH) p:05Triage>03Normal [22:54:59] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4003966 (10Dzahn) @RobH i see it's in site.pp with role spare::system. are some more of the checkboxes done meanwhile? [22:56:03] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: fileimporter/fileexporter improvements (duration: 00m 57s) [22:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:18] !log demon@tin Synchronized wmf-config/: fileimporter/fileexporter improvements (duration: 00m 58s) [22:58:38] Krinkle, no_justification, I'm not sure I understand… what CommonSettings flag are you talking about? [22:59:06] andrewbogott: The setting of readonly mode could've been committed to git/gerrit and deployed normally instead of local patch on silver. [22:59:12] So that it doens't get overridden or require scap locks. [22:59:17] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4003976 (10Dzahn) a:05RobH>03Dzahn [22:59:24] Even though I only wanted it set for a few hours? [22:59:32] andrewbogott: We make commits that last minutes :) [22:59:36] You would commit/deploy and then two hours later revert/deploy? [22:59:39] *shrug* ok [22:59:40] Yep [22:59:51] I'm trying to be polite and not merge things except via the swat process [22:59:53] andrewbogott: I don't mind, but it's for your own convenience, and predictability. [23:00:05] Given that it did.. not go as expected. [23:00:18] yeah, in retrospect that would've helped :) [23:02:42] (03PS1) 10Rush: openstack: keystone running on mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/414847 (https://phabricator.wikimedia.org/T188266) [23:03:05] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone running on mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/414847 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [23:03:50] 10Operations: setup/install deploy1001/wmf4750 - https://phabricator.wikimedia.org/T188337#4003997 (10RobH) p:05Triage>03Normal [23:06:17] (03PS1) 10Dzahn: site: turn bast1002 into a bastion host [puppet] - 10https://gerrit.wikimedia.org/r/414848 (https://phabricator.wikimedia.org/T186623) [23:08:53] Krinkle: given that 'scap lock' doesn't work and the config just updated itself on silver despite my having a lock... [23:09:02] would you consider just merging https://gerrit.wikimedia.org/r/#/c/414733/ so I can stop fighting this? [23:09:33] umm. wikitech's db seems to have gone away [23:09:52] or, no_justification, same question? [23:09:55] it's down [23:09:56] andrewbogott: I prefer not to, but Jaime or Chad might. I can't verify this right now. [23:10:53] andrewbogott: afaik locks can only be held on the deployment host (tin), not on clients. [23:10:54] (03PS1) 10Nfontes: Add Apache 2.0 license. [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/414851 [23:11:13] But that is admitedly something scap doesn't realise when you run it locally. [23:11:34] andrewbogott: Should mysql be restarted on silver then? same read-only as before? [23:11:42] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4004040 (10Dzahn) i was able to login on DRAC and get a console, then i saw Build : 4239.35 ePSA Pre-boot System Assessment and shortly after the system rebooted into Debian installer There i... [23:11:56] Seems like that should happen *after* switching the mw side of reading. [23:12:14] Anyway, I'm confused and too busy in areas I shouldn't be putting my nose in :) [23:12:27] 10Operations: update hostname label and racktables for deploy1001/wmf4750 - https://phabricator.wikimedia.org/T188339#4004049 (10RobH) p:05Triage>03Normal [23:17:06] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#4004076 (10RobH) [23:17:08] 10Operations: setup/install deploy1001/wmf4750 - https://phabricator.wikimedia.org/T188337#4004072 (10RobH) 05Open>03Invalid Already have tin replacement on T175288. [23:17:22] 10Operations: setup/install deploy1001/wmf4750 - https://phabricator.wikimedia.org/T188337#4004082 (10RobH) [23:17:24] 10Operations: update hostname label and racktables for deploy1001/wmf4750 - https://phabricator.wikimedia.org/T188339#4004078 (10RobH) 05Open>03Invalid Already have tin replacement on T175288. [23:18:07] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10RobH) So it seems this was already requested on T174452, and setup is blocked for onsite work on T175288. [23:18:44] (03CR) 10Nfontes: "Hi everyone," [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/414851 (owner: 10Nfontes) [23:19:39] 10Operations, 10ops-eqdfw, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4004108 (10RobH) To be clear, mgmt responds, but the mgmt password doesn't work. [23:19:57] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4004112 (10RobH) [23:20:51] (03PS2) 10Rush: openstack: keystone running on mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/414847 (https://phabricator.wikimedia.org/T188266) [23:25:39] (03CR) 10BryanDavis: [C: 032] wikitech: use 'labswiki' database on m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414733 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [23:25:53] Lock is on the master, not the slaves [23:26:37] Anyway: committing is important. Local hacks can (and are) routinely overwritten [23:26:44] Only way to avoid that *for certain* is to depool it [23:27:12] (03Merged) 10jenkins-bot: wikitech: use 'labswiki' database on m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414733 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [23:27:26] (03CR) 10jenkins-bot: wikitech: use 'labswiki' database on m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414733 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [23:27:39] 10Operations, 10ops-codfw, 10netops: switch port configuration for wdq200[4-6] - https://phabricator.wikimedia.org/T188303#4004161 (10ayounsi) 05Open>03Resolved Descriptions added, enabled (for the 1 that was not already), moved to the private vlan. [23:27:45] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3778593 (10awight) Do not merge patches, currently blocked on scap 3.8 dep... [23:30:31] config change seems to be the expected no-op on mwdebug1001 [23:30:57] andrewbogott: I'm going to pull the change to sliver now [23:31:04] great [23:31:56] !log Pulled T188029 change to silver [23:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:11] T188029: Move labswiki database to m5 - https://phabricator.wikimedia.org/T188029 [23:32:20] andrewbogott: it's live there now. let's check it out before I sync everywhere else [23:33:34] bd808: I logged out and in and made a trivial edit with VE [23:33:48] yeah, looks good to me too. Let do it [23:34:16] (03CR) 10Hashar: [C: 031] Add Apache 2.0 license. [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/414851 (owner: 10Nfontes) [23:34:30] !log bd808@tin Started scap: wikitech: use 'labswiki' database on m5-master (T188029) [23:34:42] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: [Blocked] Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4004179 (10awight) 05Open>03stalled [23:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:36] * bd808 wonders why scap just sat there for 30s... [23:35:58] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4004188 (10Dzahn) a:05Dzahn>03RobH [23:37:21] PROBLEM - Apache HTTP on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:51] !log bd808@tin Finished scap: wikitech: use 'labswiki' database on m5-master (T188029) (duration: 03m 21s) [23:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:08] T188029: Move labswiki database to m5 - https://phabricator.wikimedia.org/T188029 [23:38:11] RECOVERY - Apache HTTP on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.126 second response time [23:39:06] no_justification: is "error: request has exceeded memory limit in /srv/mediawiki/php-1.31.0-wmf.22/includes/parser/StripState.php on line 137" a known thing? [23:39:22] (03CR) 10Paladox: [C: 031] "bump" [puppet] - 10https://gerrit.wikimedia.org/r/410069 (owner: 10Andrew Bogott) [23:40:06] bd808: yes https://phabricator.wikimedia.org/T187833 [23:41:34] thanks thcipriani [23:42:56] Yeah what he said [23:44:15] bd808: Aye, RE: 30s waiting in scap. That was in 'scap pull', right? [23:44:18] I've been seeing the same thing [23:44:36] the scap pull one is the cdb rebuild without it telling you what it's doing [23:44:46] Krinkle: this was in "scap sync" [23:44:49] Whenever I first use scap on a server on a given day, after it is visible done (in terms of shell output), it still does something for 30s before returning the prompt. [23:44:50] for scap sync I imagine it is linting without telling you what it's doing [23:44:56] Subsequent syncs are much quicker though [23:45:11] But it happens after it's done in terms of output [23:45:18] Ah, okay [23:45:19] https://phabricator.wikimedia.org/T162207 [23:45:22] the cdb happens last? [23:45:23] if we could put the cdbs on a ram disk... it would be sooo much faster [23:45:37] Thanks, I'll subscribe there, no problem. [23:45:40] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4004238 (10RobH) a:05Cmjohnson>03RobH [23:46:21] bd808: If we could get rid of CDBs it'd be even faster [23:46:23] ;-) [23:46:38] * awight perks up [23:46:44] (03PS1) 10Catrope: Enable ORES filters on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414859 (https://phabricator.wikimedia.org/T174560) [23:46:51] no_justification: sure :) has anyone tested to see if TC cache can survive it these days? [23:46:57] YES [23:47:09] Why does everyone thing the TC isn't fixed? [23:47:15] Beating VE in #staff, burning CDB here… it’s an exciting evening [23:47:16] sweet! where's ori when you need him to roll out that then? [23:48:07] no_justification: I just by default assume that hhvm bugs are forever [23:48:50] RoanKattouw: /me points at https://sv.wikipedia.beta.wmflabs.org [23:48:56] Some aren't forever, they just come back again ;-) https://github.com/facebook/hhvm/pull/8139 [23:49:14] awight: Are you saying you want me to deploy to beta first? [23:49:23] :) it seems prudent [23:49:39] I’m not gonna pretend it’s a requirement, though [23:49:59] it’s just that… I’ve been hurt before :) [23:51:13] (03PS3) 10Dzahn: icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 [23:55:03] (03PS1) 10Awight: Enable Swedish on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414860 (https://phabricator.wikimedia.org/T174560) [23:55:20] Looks like more of a pain than I imagined. [23:57:36] (03PS2) 10Awight: Enable Swedish and Spanish Wikibooks on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414860 (https://phabricator.wikimedia.org/T174560) [23:59:08] Am I supposed to use addwiki.php for this...