[00:09:22] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3984534 (10RobH) a:03katielin Assigning to @katielin until nda signing is complete. Once do... [00:10:35] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3984536 (10RobH) [00:11:14] (03PS1) 10Krinkle: [WIP] errorpages: Remove unused hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) [02:17:08] (03CR) 10BryanDavis: "Dropping `require => latest` is a non-trivial change. It means that when we do something like roll out a new webservice package we will ha" [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [02:24:16] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.21) (duration: 05m 50s) [02:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:40] (03CR) 10BryanDavis: [C: 031] remove role::toollabs::puppetmaster and toollabs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/411614 (https://phabricator.wikimedia.org/T182810) (owner: 10Andrew Bogott) [03:04:38] (03PS2) 10Andrew Bogott: remove role::toollabs::puppetmaster and toollabs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/411614 (https://phabricator.wikimedia.org/T182810) [03:05:43] (03CR) 10Andrew Bogott: [C: 032] remove role::toollabs::puppetmaster and toollabs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/411614 (https://phabricator.wikimedia.org/T182810) (owner: 10Andrew Bogott) [03:15:36] (03PS1) 10Andrew Bogott: striker::build change requirement to debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/412836 (https://phabricator.wikimedia.org/T187743) [03:18:56] (03CR) 10Andrew Bogott: [C: 032] striker::build change requirement to debian stretch [puppet] - 10https://gerrit.wikimedia.org/r/412836 (https://phabricator.wikimedia.org/T187743) (owner: 10Andrew Bogott) [03:22:38] (03PS1) 10Krinkle: Use $wgDBname instead of IDatabase::getDBname in feed config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412837 [03:25:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.05 seconds [03:28:56] (03PS3) 10Krinkle: Remove redundant wgTemplateSandboxEditNamespaces addition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [03:29:16] (03CR) 10Krinkle: [C: 032] Remove redundant wgTemplateSandboxEditNamespaces addition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [03:30:50] (03Merged) 10jenkins-bot: Remove redundant wgTemplateSandboxEditNamespaces addition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [03:31:04] (03CR) 10jenkins-bot: Remove redundant wgTemplateSandboxEditNamespaces addition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [03:32:23] (03PS1) 10Andrew Bogott: striker::build: require libmariadbclient-dev [puppet] - 10https://gerrit.wikimedia.org/r/412838 (https://phabricator.wikimedia.org/T187743) [03:37:28] (03CR) 10Andrew Bogott: [C: 032] striker::build: require libmariadbclient-dev [puppet] - 10https://gerrit.wikimedia.org/r/412838 (https://phabricator.wikimedia.org/T187743) (owner: 10Andrew Bogott) [03:37:31] !log It seems 'scap pull' on mwdebug1002 is acting weird (prompt doesn't return until 3-5 minutes after last line of "Finished rsync common") [03:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:13] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: Ie4c7879f8ac - Clean up TemplateSandboxEditNamespaces config (duration: 00m 57s) [03:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 207.08 seconds [04:22:22] (03PS4) 10Krinkle: extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109 [04:22:26] (03CR) 10Krinkle: [C: 032] extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109 (owner: 10Krinkle) [04:24:00] (03Merged) 10jenkins-bot: extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109 (owner: 10Krinkle) [04:25:20] PROBLEM - HHVM rendering on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:11] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 74580 bytes in 0.309 second response time [04:26:57] (03PS5) 10Krinkle: multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 [04:27:39] !log krinkle@tin Synchronized w/extract2.php: Ib6d77e863b - clean up MW_LANG indirection (duration: 00m 55s) [04:27:44] (03Abandoned) 10Krinkle: [WIP] coal: Consume EventLogging from Kafka instead of ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [04:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:09] (03CR) 10jenkins-bot: extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109 (owner: 10Krinkle) [04:48:21] (03PS2) 10Krinkle: mw.org: remove old keys txt file from 2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411364 (owner: 10Chad) [04:49:08] (03CR) 10Krinkle: [C: 032] mw.org: remove old keys txt file from 2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411364 (owner: 10Chad) [04:50:41] (03Merged) 10jenkins-bot: mw.org: remove old keys txt file from 2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411364 (owner: 10Chad) [04:50:51] (03CR) 10jenkins-bot: mw.org: remove old keys txt file from 2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411364 (owner: 10Chad) [04:52:46] 10Operations, 10monitoring, 10HHVM, 10Patch-For-Review: Monitor HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3984700 (10tstarling) p:05Normal>03Low There's no longer a 512MB limit imposed by wfShellExec(). Monitoring bytecode cache size may be usefu... [04:52:56] 10Operations, 10monitoring, 10HHVM, 10Patch-For-Review: Monitor HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3984704 (10tstarling) [04:53:50] (03CR) 10Tim Starling: [C: 04-1] "If this is still needed it will need to be updated to reflect the fact that there is no longer a 512MB file size limit, so the main ration" [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) (owner: 10Muehlenhoff) [04:56:13] (03CR) 10Krinkle: [C: 031] mw.org: Symlink keys.html to index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411367 (owner: 10Chad) [04:56:27] !log krinkle@tin Synchronized docroot/mediawiki/keys/: Ie26638ed0c - rm old 2009 keys file (duration: 00m 56s) [04:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:10] 10Operations, 10monitoring, 10HHVM, 10Patch-For-Review: Monitor HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3984713 (10Krinkle) [04:57:56] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3984714 (10tstarling) >>! In T146285#3827389, @Joe wrote: > @tstarling AIUI we should be able to switch m... [05:02:09] 10Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#3984717 (10Krinkle) 05Open>03declined Closing in favour of T146285. Any hosts on which mwscript is meant to be used (maintenance hosts, dep... [05:02:17] 10Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#3984722 (10Krinkle) [05:02:19] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3984723 (10Krinkle) [05:39:40] (03CR) 10Zoranzoki21: "rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [05:40:11] (03PS18) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [05:40:30] (03CR) 10Zoranzoki21: "I rebased because I will work on this today afternoon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [05:54:24] (03CR) 10Zoranzoki21: [C: 031] "Add this in wikitech.wikimedia.org/wiki/Deployments to this can be deployed on wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦) [06:17:31] (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [06:20:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412845 (https://phabricator.wikimedia.org/T187089) [06:23:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412845 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:24:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412845 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:26:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 for alter table (duration: 00m 56s) [06:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:42] !log Deploy schema change on db1085 (with replication - this will generate lag on labs hosts) - T187089 T185128 T153182 [06:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:56] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:26:56] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:26:56] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:27:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412845 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:38:10] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984767 (10Marostegui) [06:41:32] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984768 (10Marostegui) My proposal is: Replace db2067 and db2060 (160GB) in s6 with a large host. This is the current status of s6 ``` 's6' => [ '... [06:52:12] (03PS2) 10Giuseppe Lavagetto: Improve debianization; change source package name. [software/conftool] - 10https://gerrit.wikimedia.org/r/412755 [06:58:11] !log Upgrade mariadb and kernel on db1085 [06:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:57] (03Draft2) 10KartikMistry: Deploy Compact Language Links out of Beta on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) [07:09:26] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412847 [07:11:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412847 (owner: 10Marostegui) [07:12:40] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412847 (owner: 10Marostegui) [07:12:53] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412847 (owner: 10Marostegui) [07:13:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1085 (duration: 00m 55s) [07:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412848 (https://phabricator.wikimedia.org/T187089) [07:20:08] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412848 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:21:27] uh? [07:22:50] (03PS2) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412848 (https://phabricator.wikimedia.org/T187089) [07:24:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412848 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:26:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412848 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:27:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412848 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:27:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 for alter table (duration: 00m 56s) [07:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:51] !log Deploy schema change on db1096:3316 - T187089 T185128 T153182 [07:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:05] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [07:28:05] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [07:28:05] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [07:38:30] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 [07:38:51] (03PS2) 10Marostegui: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 [07:39:11] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 (owner: 10Marostegui) [07:41:23] Not sure what is wrong with my patch [07:42:53] (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 (owner: 10Marostegui) [07:44:48] And now it works without doing anything, weird :) [07:45:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 (owner: 10Marostegui) [07:46:35] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 (owner: 10Marostegui) [07:47:44] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412850 (owner: 10Marostegui) [07:48:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1085 (duration: 00m 55s) [07:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:01] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412851 [08:05:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412851 (owner: 10Marostegui) [08:05:50] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:06:20] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:21] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:21] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:21] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:21] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:21] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:21] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:22] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:31] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:31] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:31] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:48] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412851 (owner: 10Marostegui) [08:07:09] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412851 (owner: 10Marostegui) [08:07:20] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:24] hello ganeti1006 [08:08:16] I'm taking a look on console [08:08:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1085 (duration: 01m 10s) [08:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:08] nope, dead in the water on console afaict [08:09:24] <_joe_> godog: powercycle it [08:09:36] <_joe_> it's the only solution usually, it's a kernel panic [08:09:49] !log powercycle ganeti1006 [08:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:11] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:13:30] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:13:32] (03CR) 10Giuseppe Lavagetto: [C: 032] Improve debianization; change source package name. [software/conftool] - 10https://gerrit.wikimedia.org/r/412755 (owner: 10Giuseppe Lavagetto) [08:15:12] what was the action after a forced reboot to get the vms back up again? [08:15:32] ah nevermind, the vms are just back now [08:15:40] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 5.96 ms [08:15:40] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 5.57 ms [08:15:50] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 7.57 ms [08:15:50] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 8.02 ms [08:15:50] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 9.50 ms [08:16:00] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 7.76 ms [08:16:10] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 9.39 ms [08:16:10] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 9.14 ms [08:16:10] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 10.02 ms [08:16:10] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 7.85 ms [08:16:40] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 9.03 ms [08:16:50] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 8.68 ms [08:18:29] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3984793 (10fgiunchedi) A lockup happened just now on ganeti1006, though I couldn't find any kernel messages either in the host log or syslog to confirm/deny we're seeing the same issue. [08:19:01] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412852 [08:23:15] (03CR) 10星耀晨曦: "> Add this in wikitech.wikimedia.org/wiki/Deployments to this can be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) (owner: 10星耀晨曦) [08:23:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412852 (owner: 10Marostegui) [08:24:50] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412852 (owner: 10Marostegui) [08:25:03] godog: those ganeti hangs sort themselves out (takes a few mins depending on how machines VMs are on it apparently) [08:25:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1085 (duration: 00m 55s) [08:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:17] moritzm: right, do you know how long after it usually recovers? [08:26:32] <_joe_> !log uploading conftool 1.0.0~beta1 to jessie [08:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:00] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412852 (owner: 10Marostegui) [08:27:31] godog: when one of the test hosts crashed (with only one sca host) IIRC was < 2 mins, but I remember that it also took close to 10 mins in one case (when the ganeti hosts were somewhat unbalanced at the affected host has 9 VMs on it) [08:28:31] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:28:44] ack, so in the order of 10/15 min it should recover [08:32:04] <_joe_> !log uploading conftool 1.0.0~beta1 on stretch [08:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:08] (03PS3) 10Giuseppe Lavagetto: conftool::scripts: update scripts to work with conftool 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/412673 [08:38:46] (03PS4) 10Elukey: profile::zookeeper::server: remove explicit java-7 dependency [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) [08:41:43] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [08:43:45] (03CR) 10Elukey: [C: 032] profile::zookeeper::server: remove explicit java-7 dependency [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [08:44:12] (03CR) 10Elukey: [V: 032 C: 032] Simplify zookeeper's default template to be systemd friendly [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/412744 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [08:48:10] (03PS1) 10Elukey: Update zookeeper's module to its latest revision [puppet] - 10https://gerrit.wikimedia.org/r/412857 (https://phabricator.wikimedia.org/T166081) [08:51:50] !log oblivian@puppetmaster2001 conftool action : edit; selector: scope=common [08:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:17] <_joe_> /win 28 [08:52:23] (03CR) 10Elukey: [C: 032] "Nothing unexpected: https://puppet-compiler.wmflabs.org/compiler02/10028/conf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/412857 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [08:56:50] !log oblivian@puppetmaster2001 conftool action : edit; selector: scope=codfw [08:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:25] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1085 and db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412858 [08:58:28] (03Abandoned) 10Muehlenhoff: Add Icinga check for depletion of HHVM CLI cache [puppet] - 10https://gerrit.wikimedia.org/r/359120 (https://phabricator.wikimedia.org/T161598) (owner: 10Muehlenhoff) [08:58:31] (03PS1) 10Elukey: profile::zookeeper::server: add the correct package name for default-jdk [puppet] - 10https://gerrit.wikimedia.org/r/412859 (https://phabricator.wikimedia.org/T166081) [08:59:58] (03CR) 10Volans: [C: 04-1] "I like the approach and the backward compatibility, just a couple of things to fix, see inline." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412673 (owner: 10Giuseppe Lavagetto) [09:00:32] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART for lab(test) [puppet] - 10https://gerrit.wikimedia.org/r/412860 (https://phabricator.wikimedia.org/T86552) [09:00:53] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable SMART for lab(test) [puppet] - 10https://gerrit.wikimedia.org/r/412860 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [09:01:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1085 and db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412858 (owner: 10Marostegui) [09:01:27] (03PS2) 10Filippo Giunchedi: hieradata: enable SMART for lab(test) [puppet] - 10https://gerrit.wikimedia.org/r/412860 (https://phabricator.wikimedia.org/T86552) [09:01:30] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[default-jdk-headless] [09:01:53] !log oblivian@puppetmaster2001 conftool action : edit; selector: scope=codfw [09:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:17] druid1001 is me :) [09:02:18] !log oblivian@puppetmaster2001 conftool action : set/val=false; selector: scope=eqiad,name=ReadOnly [09:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:32] (03CR) 10Elukey: [C: 032] profile::zookeeper::server: add the correct package name for default-jdk [puppet] - 10https://gerrit.wikimedia.org/r/412859 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [09:02:47] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1085 and db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412858 (owner: 10Marostegui) [09:03:01] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1085 and db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412858 (owner: 10Marostegui) [09:03:03] !log oblivian@puppetmaster2001 conftool action : set/val=false; selector: scope=eqiad,name=ReadOnly [09:03:09] !log oblivian@puppetmaster2001 conftool action : set/val=false; selector: scope=eqiad,name=ReadOnly [09:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1096:3316 and db1085 (duration: 00m 55s) [09:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412861 (https://phabricator.wikimedia.org/T187089) [09:06:30] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:06:52] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412861 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:07:53] (03PS2) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412861 (https://phabricator.wikimedia.org/T187089) [09:09:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412861 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:11:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412861 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:11:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412861 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:12:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1088 for alter table (duration: 00m 55s) [09:12:17] !log Deploy schema change on db1088 - T187089 T185128 T153182 [09:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:28] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984885 (10jcrespo) I would like to focus on s4. s4 needs better hardware than s6, then do: ``` 'db2051' => 0, # B8 2.9TB 160GB, master - 'db2037' =>... [09:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:39] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [09:12:39] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [09:12:40] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [09:13:29] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984891 (10Marostegui) >>! In T187722#3984885, @jcrespo wrote: > I would like to focus on s4. s4 needs better hardware than s6, then do: I was doubting betwee... [09:14:31] !log restart zookeeper on druid1001 (follower) to verify that the last changes are no-op [09:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:35] !log Data checks for db2037 before removing it from s4 - T187722 [09:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:49] T187722: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722 [09:17:51] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984897 (10jcrespo) We also have backups from the host that crashed itself on dbstore2001, I think, we could use them to reconstruct m5 without touching the ma... [09:19:46] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984899 (10jcrespo) >>! In T187722#3984891, @Marostegui wrote: >>>! In T187722#3984885, @jcrespo wrote: >> I would like to focus on s4. s4 needs better hardwar... [09:22:53] 10Operations, 10ops-codfw, 10DBA: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3984900 (10Marostegui) >>! In T187722#3984897, @jcrespo wrote: > We also have backups from the host that crashed itself on dbstore2001, I think, we could use t... [09:26:55] (03CR) 10Giuseppe Lavagetto: conftool::scripts: update scripts to work with conftool 1.0 (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412673 (owner: 10Giuseppe Lavagetto) [09:28:25] (03PS4) 10Giuseppe Lavagetto: conftool::scripts: update scripts to work with conftool 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/412673 [09:32:16] 10Operations, 10netops: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10dkg) hm, it might be nice to have access to that space in `/srv`, but i don't think it's necessary right now. it looks like some extra space was already freed up earlier, and i've freed up mo... [09:35:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/412673 (owner: 10Giuseppe Lavagetto) [09:35:49] 10Operations, 10netops: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10elukey) I would spend a bit of time trying to move /var/lib/postresql to /srv to avoid the recurrence of this issue, a root partition so small is not meant to keep database data in my opinion :) [09:38:10] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3984917 (10Gehel) [09:38:50] PROBLEM - DPKG on puppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:39:50] RECOVERY - DPKG on puppetmaster2001 is OK: All packages OK [09:42:00] (03PS4) 10Filippo Giunchedi: prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430 (https://phabricator.wikimedia.org/T184469) [09:42:09] (03CR) 10Hashar: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412813 (owner: 10Volans) [09:43:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412865 [09:44:31] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool::scripts: update scripts to work with conftool 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/412673 (owner: 10Giuseppe Lavagetto) [09:46:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412865 (owner: 10Marostegui) [09:47:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412865 (owner: 10Marostegui) [09:47:49] (03CR) 10Filippo Giunchedi: "Tested on deployment-prep below, PCC https://puppet-compiler.wmflabs.org/compiler02/10030/" [puppet] - 10https://gerrit.wikimedia.org/r/404430 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [09:47:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412865 (owner: 10Marostegui) [09:48:53] (03PS5) 10Filippo Giunchedi: prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430 (https://phabricator.wikimedia.org/T184469) [09:49:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1088 after alter table (duration: 00m 55s) [09:49:15] !log Deploy schema change on s6 primary master db1061 - T185128 T153182 [09:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:30] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [09:49:30] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [09:56:22] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: also install pysocks [puppet] - 10https://gerrit.wikimedia.org/r/412867 [09:57:47] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::conftool::client: also install pysocks [puppet] - 10https://gerrit.wikimedia.org/r/412867 (owner: 10Giuseppe Lavagetto) [10:00:00] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [10:00:58] (03PS6) 10Filippo Giunchedi: prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430 (https://phabricator.wikimedia.org/T184469) [10:02:01] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:03:41] PROBLEM - puppet last run on ores1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:04:01] PROBLEM - puppet last run on wtp1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:04:20] PROBLEM - puppet last run on ores2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:04:21] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:04:30] PROBLEM - puppet last run on wtp1046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:05:21] PROBLEM - puppet last run on ms-fe1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:05:37] there's a problem with the package name on stretch [10:05:40] PROBLEM - puppet last run on ms-fe1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:05:51] PROBLEM - puppet last run on ores2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:06:13] _joe_: ^^^ [10:06:17] it's a virtual package, and on stretch the package name is python-socks [10:06:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2030 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412868 (https://phabricator.wikimedia.org/T187768) [10:06:50] PROBLEM - puppet last run on ores2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:07:01] <_joe_> wat? [10:07:10] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:07:14] <_joe_> the package is there on jessie, not on stretch I guess? [10:07:28] it's python-socks in stretch [10:07:53] <_joe_> yeah [10:07:55] <_joe_> just seen [10:07:57] <_joe_> sigh [10:08:00] PROBLEM - puppet last run on wtp1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:08:03] <_joe_> ok fixing it [10:08:10] PROBLEM - puppet last run on ores2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:08:21] PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:08:30] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:08:40] PROBLEM - puppet last run on wtp1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:09:01] PROBLEM - puppet last run on wtp1042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:09:21] PROBLEM - puppet last run on ores1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:09:51] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:10:00] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:11:21] PROBLEM - puppet last run on ores1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:11:34] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 [10:11:39] <_joe_> moritzm: ^^ [10:12:11] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 (owner: 10Giuseppe Lavagetto) [10:13:12] <_joe_> heh something's wrong there [10:13:17] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db2030 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412868 (https://phabricator.wikimedia.org/T187768) (owner: 10Marostegui) [10:13:31] !log unified python-requests-mock packages in apt.wikimedia.org jessie-wikimedia to be 1.3.0-3~wmf1, removed binaries for 1.3.0-3 [10:13:37] looking [10:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:06] (03CR) 10Muehlenhoff: [C: 031] profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 (owner: 10Giuseppe Lavagetto) [10:14:10] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:14:20] PROBLEM - puppet last run on ores2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:14:27] <_joe_> moritzm: there is a syntax error there apparently [10:14:30] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:14:30] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:14:41] (03PS2) 10Urbanecm: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410475 (https://phabricator.wikimedia.org/T187018) [10:14:51] PROBLEM - puppet last run on wtp1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:14:51] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:14:52] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2030 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412868 (https://phabricator.wikimedia.org/T187768) (owner: 10Marostegui) [10:15:00] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:15:20] (03PS2) 10Giuseppe Lavagetto: profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 [10:15:40] (03Abandoned) 10Urbanecm: Restrict merging rights to autoconfirmed users on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345) (owner: 10Urbanecm) [10:15:50] PROBLEM - puppet last run on acrab is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:15:51] PROBLEM - puppet last run on ores2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:15:58] (03PS3) 10Giuseppe Lavagetto: profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 [10:16:15] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 (owner: 10Giuseppe Lavagetto) [10:16:22] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::conftool::client: fix pysocks package name on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412869 (owner: 10Giuseppe Lavagetto) [10:16:27] (03Abandoned) 10Urbanecm: Switch Wikipedias from $wgLogoHD to direct using of a SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401523 (https://phabricator.wikimedia.org/T178942) (owner: 10Urbanecm) [10:16:51] PROBLEM - puppet last run on ores2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:17:05] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2030 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412868 (https://phabricator.wikimedia.org/T187768) (owner: 10Marostegui) [10:17:10] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:17:50] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:17:51] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:18:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db2030 from config - T187768 (duration: 00m 56s) [10:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:30] T187768: Decommission db2030 - https://phabricator.wikimedia.org/T187768 [10:19:00] PROBLEM - puppet last run on chlorine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:19:01] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:19:11] <_joe_> ok the issue is fixed AFAICS [10:19:50] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:20:25] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db2030 from config - T187768 (duration: 00m 55s) [10:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:50] PROBLEM - puppet last run on ms-fe2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:22:30] PROBLEM - puppet last run on wtp1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:23:00] PROBLEM - puppet last run on ores1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-pysocks] [10:23:20] RECOVERY - puppet last run on wtp1044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:24:21] RECOVERY - puppet last run on ms-fe2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:28:49] 10Operations, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Effects on adjusting Prometheus retention - https://phabricator.wikimedia.org/T160677#3985012 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I bumped the minimum retention period to six months for a... [10:32:10] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:32:10] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:33:30] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:33:40] RECOVERY - puppet last run on ores1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:34:01] RECOVERY - puppet last run on wtp1040 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:34:01] RECOVERY - puppet last run on wtp1042 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:34:19] (03PS1) 10Giuseppe Lavagetto: Add a couple bugfixes to the new features: [software/conftool] - 10https://gerrit.wikimedia.org/r/412870 [10:34:20] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:34:21] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:34:30] RECOVERY - puppet last run on wtp1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:35:21] RECOVERY - puppet last run on ms-fe1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:35:40] RECOVERY - puppet last run on ms-fe1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:35:48] (03CR) 10jerkins-bot: [V: 04-1] Add a couple bugfixes to the new features: [software/conftool] - 10https://gerrit.wikimedia.org/r/412870 (owner: 10Giuseppe Lavagetto) [10:35:51] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:36:50] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:37:26] (03CR) 10Giuseppe Lavagetto: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/412870 (owner: 10Giuseppe Lavagetto) [10:37:56] (03PS3) 10KartikMistry: Deploy Compact Language Links out of Beta on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412696 (https://phabricator.wikimedia.org/T187677) [10:38:00] RECOVERY - puppet last run on wtp1031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:38:10] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:38:40] RECOVERY - puppet last run on wtp1038 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:38:54] (03CR) 10jerkins-bot: [V: 04-1] Add a couple bugfixes to the new features: [software/conftool] - 10https://gerrit.wikimedia.org/r/412870 (owner: 10Giuseppe Lavagetto) [10:39:20] RECOVERY - puppet last run on ores2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:39:30] RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:30] RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:50] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:39:50] RECOVERY - puppet last run on wtp1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:50] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:40:00] RECOVERY - puppet last run on wtp1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:40:50] RECOVERY - puppet last run on acrab is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:40:51] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:41:21] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:41:51] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:42:06] <_joe_> + bailout 1 'Error: Failed to create cowbuilder base /srv/pbuilder/base-jesse-amd64.cow/.' [10:42:21] <_joe_> hashar: can we remove the debian-glue job from conftool? [10:42:29] <_joe_> it fails more often than not [10:42:50] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:43:00] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:43:16] (03PS1) 10Jcrespo: mariadb: Depool db2037 and db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412871 (https://phabricator.wikimedia.org/T187722) [10:43:49] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2037 and db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412871 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:44:10] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:44:45] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "damn debian-glue nonsense." [software/conftool] - 10https://gerrit.wikimedia.org/r/412870 (owner: 10Giuseppe Lavagetto) [10:45:00] RECOVERY - puppet last run on wtp1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:37] (03PS1) 10Jcrespo: mariadb: Remove db2037 and db2044 for mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412872 (https://phabricator.wikimedia.org/T187722) [10:45:50] RECOVERY - puppet last run on ms-fe2006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:46:12] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2037 and db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412871 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:47:11] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:47:12] (03CR) 10jenkins-bot: mariadb: Depool db2037 and db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412871 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:47:22] (03CR) 10Marostegui: "We kept some hosts on the array list even if they were on misc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412872 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:47:30] RECOVERY - puppet last run on wtp1043 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:48:00] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:49:00] RECOVERY - puppet last run on chlorine is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:49:01] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:49:46] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2037 and db2044 (duration: 00m 55s) [10:49:50] RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:58] 10Operations, 10Traffic: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789604 (10Vgutierrez) For future references, nginx now (since 1.13.1) workarounds this issue setting TCP_NODELAY before doing the handshake: https://trac.nginx.org/nginx/ticket/413#comment:8 OpenSSL removed acces... [10:52:01] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3985051 (10akosiaris) Extended testing during 2 days on a VM without DRBD backing store on ganeti1005 caused no issues. Moving on to testing a DRBD disk directly on the host [10:53:50] (03CR) 10Jcrespo: [C: 032] "I think we should remove them, if they are kept but are not referenced anywhere else, that is a "bug" to me, even if nothing is broken. If" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412872 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:55:22] (03Merged) 10jenkins-bot: mariadb: Remove db2037 and db2044 for mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412872 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:57:07] (03CR) 10jenkins-bot: mariadb: Remove db2037 and db2044 for mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412872 (https://phabricator.wikimedia.org/T187722) (owner: 10Jcrespo) [10:58:54] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db2037 and db2044 (duration: 00m 53s) [10:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db2037 and db2044 (duration: 00m 55s) [11:00:07] !log Deploy schema change on s2 codfw master (db2035) with replication, this will generate lag on codfw - T187089 T185128 T153182 [11:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:20] (03PS1) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [11:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:32] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [11:00:32] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [11:00:32] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [11:01:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3985083 (10jcrespo) Aside from m5, what is the other host for? x1 or m2? [11:01:43] (03CR) 10Gehel: [C: 04-1] "The related wdqs updater startup script needs to be merged first." [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [11:01:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3985086 (10Marostegui) So I think db2037 -> m5 db2044 -> x1/m2 no? [11:02:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3985088 (10Marostegui) m2, sorry [11:09:08] !log Deploy schema change on labtestweb2001 - T153182 T185128 T187089 [11:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:23] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [11:09:23] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [11:09:23] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [11:12:45] (03PS1) 10Giuseppe Lavagetto: Fix parameters injection in ToolCliSimpleAction [software/conftool] - 10https://gerrit.wikimedia.org/r/412875 [11:14:10] (03CR) 10jerkins-bot: [V: 04-1] Fix parameters injection in ToolCliSimpleAction [software/conftool] - 10https://gerrit.wikimedia.org/r/412875 (owner: 10Giuseppe Lavagetto) [11:14:49] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fix parameters injection in ToolCliSimpleAction [software/conftool] - 10https://gerrit.wikimedia.org/r/412875 (owner: 10Giuseppe Lavagetto) [11:15:17] (03PS1) 10Marostegui: install_server: Allow db2037 to reinstall [puppet] - 10https://gerrit.wikimedia.org/r/412876 (https://phabricator.wikimedia.org/T187722) [11:16:06] (03CR) 10Jcrespo: [C: 031] install_server: Allow db2037 to reinstall [puppet] - 10https://gerrit.wikimedia.org/r/412876 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [11:16:12] (03CR) 10Filippo Giunchedi: elasticsearch: collect elasticsearch metrics on per node percentiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [11:16:19] (03CR) 10Marostegui: [C: 032] install_server: Allow db2037 to reinstall [puppet] - 10https://gerrit.wikimedia.org/r/412876 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [11:20:42] (03PS1) 10Filippo Giunchedi: hieradata: add swift private containers secret [labs/private] - 10https://gerrit.wikimedia.org/r/412877 [11:21:43] (03PS1) 10Marostegui: db2037: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/412878 (https://phabricator.wikimedia.org/T187722) [11:22:35] (03CR) 10Marostegui: [C: 032] db2037: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/412878 (https://phabricator.wikimedia.org/T187722) (owner: 10Marostegui) [11:23:04] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add swift private containers secret [labs/private] - 10https://gerrit.wikimedia.org/r/412877 (owner: 10Filippo Giunchedi) [11:23:17] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: add swift private containers secret [labs/private] - 10https://gerrit.wikimedia.org/r/412877 (owner: 10Filippo Giunchedi) [11:23:49] (03PS2) 10Volans: Migrate the server side to Python3 [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/405879 [11:24:10] !log upgrding mariadb-client on neodymium and sarin [11:24:20] PROBLEM - Apache HTTP on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:48] (03CR) 10Volans: "done" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/405879 (owner: 10Volans) [11:25:11] RECOVERY - Apache HTTP on mw2206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.121 second response time [11:35:26] Hi, i am getting this puppet error. [11:35:27] The following packages have unmet dependencies: [11:35:27] python-conftool : Depends: python-jsonschema but it is not going to be installed [11:35:36] (on a stretch instance) [11:36:05] full error https://phabricator.wikimedia.org/P6719 [11:36:30] (03PS1) 10Elukey: role::prometheus::analytics|ops: add Kafka Burrow jobs [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) [11:37:36] trying to install python-jsonschema manually shows [11:37:38] python-jsonschema : Depends: python-mock but it is not going to be installed [11:37:52] python-mock : Depends: python-pbr (>= 1.3) but 0.8.2-1 is to be installed [11:37:59] oh [11:38:11] that is going to break zuul [11:39:16] <_joe_> paladox: uhm, that is strange. it built correctly [11:39:30] _joe_ it seems that it wants python-pbr 1.3+ [11:39:41] but i have to have python-pbr 0.8.2 installed for zuul [11:39:41] <_joe_> paladox: that's strange, tbh [11:39:48] <_joe_> oh I see [11:39:53] <_joe_> so that's the issue [11:39:58] otherwise pbr will break for zuul [11:40:32] <_joe_> ok, then I guess we have to keep zuul on jessie until we have a newer version? [11:40:36] (im using the ci puppet class) [11:40:44] (03CR) 10Filippo Giunchedi: Give officewiki read access to Thumbor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [11:40:59] <_joe_> yeah, that's a problem that's specific to zuul though, I hope [11:41:19] <_joe_> lemme verify it [11:41:23] ok [11:42:02] there's this task https://phabricator.wikimedia.org/T162787 [11:42:43] <_joe_> paladox: gimme 5 minutes [11:42:48] ok [11:42:52] thanks [11:45:19] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 6: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [11:47:32] <_joe_> paladox: yes it installs flawlessly on a stretch host [11:47:39] ok thanks. [11:47:43] <_joe_> so it's something with zuul, I guess [11:47:55] _joe_ yeh it's the version number [11:47:57] <_joe_> the production zuul is on jessie, correct? [11:48:01] pbr 1.x is alot strict [11:48:03] and yes i think so [11:48:27] <_joe_> win 19 [11:49:01] <_joe_> paladox: ok then, I guess for stretch we'll need a newer zuul, or to install it not via a debian package [11:50:13] _joe_ i am going to try and fix it in https://gerrit.wikimedia.org/r/#/c/412884/ :) [11:51:09] <_joe_> ok! I can go on and upgrade python-conftool in production then [11:53:06] (03PS1) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [11:53:17] (03CR) 10Elukey: "Tried to use pcc for prometheus1003 but it takes ages to complete so I've aborted it: https://integration.wikimedia.org/ci/job/operations-" [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [11:53:33] (03CR) 10jerkins-bot: [V: 04-1] Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [11:56:46] (03PS2) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [11:56:56] (03PS3) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [11:57:27] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [11:59:09] (03PS4) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [11:59:21] (03CR) 10Gilles: Give officewiki read access to Thumbor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [11:59:27] (03CR) 10Elukey: [C: 031] cassandra: create parent data directories with exec [puppet] - 10https://gerrit.wikimedia.org/r/404705 (https://phabricator.wikimedia.org/T175284) (owner: 10Eevans) [11:59:33] (03PS5) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [12:00:09] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [12:11:50] <_joe_> !log upgrading conftool to 1.0.0~beta2 on scb* [12:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:40] (03PS3) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [12:16:08] (03CR) 10jerkins-bot: [V: 04-1] Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [12:18:28] (03PS4) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [12:38:09] 10Operations, 10Traffic, 10Performance, 10Performance-Team (Radar): missing H2 coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#3985229 (10BBlack) We actually do use the same cert for both, so we don't need the secondary certs bit. Remaining b... [12:40:51] (03Abandoned) 10Ema: Attempt running puppet again in case of failure [puppet] - 10https://gerrit.wikimedia.org/r/298921 (owner: 10Ema) [12:42:49] (03PS1) 10Volans: Cumin masters: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) [12:43:21] (03CR) 10jerkins-bot: [V: 04-1] Cumin masters: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans) [12:44:04] lovely missing support of py3 :( [12:44:23] * volans should find the time to work on T184435 [12:44:44] 10Operations, 10netops: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10akosiaris) What exactly is postgresql doing on that machine without it being puppetized ? There's a very strict rule against this and is very clearly spelled out in L3. [12:46:12] let's try to trick it [12:46:17] (03PS2) 10Volans: Cumin masters: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) [12:49:44] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3985257 (10mmodell) I posted this upstream: https://dis... [12:59:50] 10Operations, 10netops: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3985286 (10faidon) 05Open>03Resolved a:03faidon I've deleted a 7.7G file and freed up some space. As for Postgres, it's for a temporary situation for a bit of a high-priority and unusual situation,... [13:03:51] !log installing libav security updates [13:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:04] (03CR) 10Rush: [C: 04-1] "Let's talk about this in our meeting. We have had to purge require_package() from the majority of the openstack provider code because it " [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [13:09:22] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:51] (03PS1) 10Urbanecm: Add romd.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184) [13:10:13] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 74550 bytes in 0.289 second response time [13:13:17] (03PS1) 10Urbanecm: Add romd.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/412898 [13:14:03] (03CR) 10Arturo Borrero Gonzalez: "According to @_joe_, puppet4 now takes into account declaration ordering. I'm not sure if this could be of any help in case statements are" [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [13:14:39] (03PS2) 10Urbanecm: Add romd.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/412898 [13:14:52] (03PS3) 10Urbanecm: Add romd.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) [13:16:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "As is it will not work. See inline comments and PCC output at https://puppet-compiler.wmflabs.org/compiler02/10039/boron.eqiad.wmnet/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [13:16:50] (03PS4) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) [13:17:56] !log Stop MySQL and reboot db1112 for kernel and mariadb upgrade [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:22] (03PS3) 10Matthias Mullie: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) [13:18:32] (03CR) 10Rush: "Two packages I would like to see pinned to current versions in Toolforge: nginx and kubernetes* (common, client, etc)" [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:19:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, there's a port change needed before this" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [13:22:47] (03CR) 10Elukey: role::prometheus::analytics|ops: add Kafka Burrow jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [13:23:45] !log Stop MySQL and reboot db1111 for kernel and mariadb upgrade [13:23:54] (03PS2) 10Elukey: role::prometheus::analytics|ops: add Kafka Burrow jobs [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) [13:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:02] (03CR) 10Rush: [C: 04-1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [13:25:21] 10Operations, 10Discovery, 10Epic, 10Maps (Maps-data): Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#3985367 (10Gehel) [13:27:01] (03PS5) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [13:29:04] !log Upgrade kernel and reboot db1113 and db1114 [13:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:46] (03PS1) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) [13:31:05] (03CR) 10Arturo Borrero Gonzalez: "> I don't think this has the effect we would intend. P4 afaiu does" [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [13:32:21] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [13:33:43] 10Operations, 10Traffic: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3985419 (10ema) [13:33:57] 10Operations, 10Traffic: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3985430 (10ema) p:05Triage>03Low [13:34:09] (03CR) 10Muehlenhoff: Use security mirrors in cowbuilder apt config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [13:34:53] (03PS2) 10Rush: openstack: neutron l3 and service for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/411488 (https://phabricator.wikimedia.org/T167293) [13:36:03] (03CR) 10Rush: [C: 032] openstack: neutron l3 and service for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/411488 (https://phabricator.wikimedia.org/T167293) (owner: 10Rush) [13:39:17] (03CR) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [13:39:50] 10Operations, 10Mail, 10monitoring: tls expiry check for mx vs acme-setup renewal period - https://phabricator.wikimedia.org/T181519#3985473 (10fgiunchedi) [13:41:26] (03PS8) 10Ema: pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) [13:42:26] (03CR) 10Ema: [C: 032] pybal: check established TCP connections to etcd [puppet] - 10https://gerrit.wikimedia.org/r/409922 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema) [13:44:27] (03CR) 10Filippo Giunchedi: elasticsearch: collect elasticsearch metrics on per node percentiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [13:45:49] question for the collective - does +accountcreator allow the creation of accounts on a blocked IP address which has an anon only block and account creation disabled ? [13:46:13] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/412906 [13:48:26] (03CR) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [13:49:32] 10Operations, 10Traffic, 10Performance, 10Performance-Team (Radar): missing H2 coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#3985528 (10Gilles) > legacy HTTP/1 UAs may suffer due to UA limits I believe that the connection limit UAs have fo... [13:49:40] (03CR) 10Volans: "LGTM, minor note inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412906 (owner: 10Giuseppe Lavagetto) [13:54:46] (03PS6) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [13:57:05] (03CR) 10Giuseppe Lavagetto: conftool::scripts: backwards compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412906 (owner: 10Giuseppe Lavagetto) [13:57:53] jouncebot, next [13:57:54] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180220T1400) [13:58:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655) (owner: 10MarcoAurelio) [13:59:42] (03CR) 10Filippo Giunchedi: Use security mirrors in cowbuilder apt config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180220T1400). Please do the needful. [14:00:05] Hauskatze and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] Urbanecm: your commit https://gerrit.wikimedia.org/r/#/c/412672/ conflicts with https://gerrit.wikimedia.org/r/#/c/412606/ [14:00:12] I can SWAT today [14:00:29] Hauskatze: around for SWAT? [14:00:40] zeljkof, he said he won't be around [14:00:42] But please deploy [14:00:56] (03Merged) 10jenkins-bot: throttle: add new rule for Wikidata edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655) (owner: 10MarcoAurelio) [14:00:57] After deployment of MarcoAurelio's (Hauskatze's) patch, I'll solve the conflict [14:02:29] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/412906 [14:03:21] (03CR) 10jenkins-bot: throttle: add new rule for Wikidata edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655) (owner: 10MarcoAurelio) [14:03:29] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool::scripts: backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/412906 (owner: 10Giuseppe Lavagetto) [14:03:37] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:412606|throttle: add new rule for Wikidata edit-a-thon (T187655)]] (duration: 00m 56s) [14:03:46] Urbanecm: you are next :) [14:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:52] (03PS5) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) [14:03:52] T187655: IP unblock requested for 20-30 new accounts being created at University of Edinburgh event. - https://phabricator.wikimedia.org/T187655 [14:03:57] zeljkof, conflict resolved [14:04:01] gerrit now does now complain about 412672 [14:04:08] oh, cool, thanks, reviewing [14:04:18] yw [14:04:35] Urbanecm: um, you have added a few logos to 412672 [14:04:54] zeljkof, how... Will delete :) [14:05:28] :) [14:05:38] (03PS6) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) [14:05:44] zeljkof, they're gone ;) [14:05:48] hello [14:05:51] Jayprakash12345, hi [14:06:05] Jayprakash12345: hi [14:06:12] Urbanecm: reviewing [14:06:21] zeljkof, and let's add the logos to 412902 where they should be :D [14:06:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) (owner: 10Urbanecm) [14:06:55] (03PS2) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) [14:07:44] (03PS3) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) [14:08:25] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) (owner: 10Urbanecm) [14:08:49] I submited one patch for Depolyment, Is all ok? [14:09:16] Jayprakash12345: sure, you are in line after Urbanecm [14:09:27] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) (owner: 10Urbanecm) [14:10:57] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:412672|New throttle rule (T187171)]] (duration: 00m 55s) [14:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:10] T187171: Temporary lift of IP cap account creation for en wiki on March 8th from 10am UTC to 2pm UTC for editathon - https://phabricator.wikimedia.org/T187171 [14:11:40] Urbanecm: 412672 deployed [14:11:46] zeljkof, ack [14:11:51] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410475 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:13:20] (03CR) 10Filippo Giunchedi: role::prometheus::analytics|ops: add Kafka Burrow jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:13:22] (03Merged) 10jenkins-bot: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410475 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:13:30] (03CR) 10jenkins-bot: Add suppressredirect to autoconfirmed at zhwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410475 (https://phabricator.wikimedia.org/T187018) (owner: 10Urbanecm) [14:13:48] (03PS7) 10Muehlenhoff: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 [14:14:13] Urbanecm: 410475 is at mwdebug1002 [14:14:23] zeljkof, going to test [14:14:43] (03PS5) 10Urbanecm: Add Draft namespace to hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [14:15:04] (03PS3) 10Elukey: role::prometheus::analytics|ops: add Kafka Burrow jobs [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) [14:15:10] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::analytics|ops: add Kafka Burrow jobs [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:15:41] zeljkof, working, please deploy [14:15:44] (03CR) 10Elukey: [C: 032] role::prometheus::analytics|ops: add Kafka Burrow jobs [puppet] - 10https://gerrit.wikimedia.org/r/412881 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:15:46] deploying [14:16:01] ack [14:16:49] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:410475|Add suppressredirect to autoconfirmed at zhwikt (T187018)]] (duration: 00m 55s) [14:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] T187018: Request to change autoconfirmed settings, allow autoconfirmed user to suppress redirects and allow sysop to grant and remove flood flags on zh.wiktionary - https://phabricator.wikimedia.org/T187018 [14:17:07] Urbanecm: deployed, please test and thanks for deploying with #releng ;) [14:17:22] Yw :) [14:17:44] Jayprakash12345: you are next, I am reviewing 412081 [14:18:01] zeljkof: Ok [14:18:04] I will let you know when it's ready for testing at mwdebug1002 [14:18:12] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 34.60, 33.72, 32.02 [14:18:19] zeljkof: :) [14:20:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [14:21:58] (03Merged) 10jenkins-bot: Add Draft namespace to hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [14:22:00] (03CR) 10jenkins-bot: Add Draft namespace to hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [14:22:34] Jayprakash12345: 412081 is at mwdebug1002, please test and let me know if I can deploy [14:24:20] zeljkof: Tested, Please deploy [14:24:27] Jayprakash12345: deploying [14:24:44] zeljkof: Run script as well [14:24:54] sure [14:25:05] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler03/10048/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:25:20] (03PS6) 10Filippo Giunchedi: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:25:37] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:412081|Add Draft namespace to hiwikiversity. (T187535)]] (duration: 00m 56s) [14:25:47] Jayprakash12345: deployed, running script [14:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:53] T187535: Create Draft namespace on hiwikiversity - https://phabricator.wikimedia.org/T187535 [14:25:57] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:26:32] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:26:44] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler02/10047/boron.eqiad.wmnet/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [14:27:42] Jayprakash12345, Urbanecm: do you know what to do? https://phabricator.wikimedia.org/T187535#3985650 [14:27:54] id=1473 ns=0 dbk=विकिविद्यालय:मुखपृष्ठ *** dest title exists and --add-prefix not specified [14:28:10] zeljkof, let me know, I'll have a look at the docs [14:28:20] also looking... [14:28:24] (03CR) 10Volans: [C: 031] "I didn't test it, but code looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [14:28:31] means there is no page for fix [14:28:38] *let me look... [14:28:39] As I know [14:28:45] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:50] Jayprakash12345, no, I don't think so [14:28:58] विकिविद्यालय:मुखपृष्ठ is the page [14:29:07] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] "Overriding jenkins' -1 due to hiera_array usage" [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [14:29:22] There is no draft page in hiwikiversity. I am there since its incub. [14:29:31] zeljkof, run mwscript namespaceDupes.php hiwikiversity --fix --add-prefix=T187535 [14:29:51] Jayprakash12345, what means "विकिविद्यालय"? [14:29:58] Urbanecm: running [14:30:08] Old Project Namespace. [14:30:28] Jayprakash12345, that means there's a problem with this particular namespace (somebody didn't run namespaceDupes.php) [14:30:35] zeljkof, thanks! [14:30:57] Jayprakash12345, Urbanecm: looks like it fixed the problem https://phabricator.wikimedia.org/T187535#3985654 [14:31:28] (03PS3) 10Zfilipin: Update the sitename of newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409557 (https://phabricator.wikimedia.org/T186952) (owner: 10Tulsi Bhagat) [14:31:30] That's great! [14:31:54] zeljkof: Looks good [14:32:05] Jayprakash12345, if you're a sysop, this https://hi.wikiversity.org/w/index.php?title=%E0%A4%B5%E0%A4%BF%E0%A4%95%E0%A4%BF%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%8D%E0%A4%B5%E0%A4%B5%E0%A4%BF%E0%A4%A6%E0%A5%8D%E0%A4%AF%E0%A4%BE%E0%A4%B2%E0%A4%AF:T187535%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0&redirect=no looks useless from my point of view [14:32:05] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:32:05] T187535: Create Draft namespace on hiwikiversity - https://phabricator.wikimedia.org/T187535 [14:33:05] Urbanecm: Yes, I am deleting this. [14:33:24] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409557 (https://phabricator.wikimedia.org/T186952) (owner: 10Tulsi Bhagat) [14:33:28] Jayprakash12345, thanks! [14:34:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/405879 (owner: 10Volans) [14:34:05] Urbanecm: Thank you very much for help us. [14:34:13] Jayprakash12345, yw [14:34:24] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 8.87, 15.23, 23.88 [14:34:54] (03Merged) 10jenkins-bot: Update the sitename of newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409557 (https://phabricator.wikimedia.org/T186952) (owner: 10Tulsi Bhagat) [14:35:09] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3985661 (10mmodell) [14:36:09] Jayprakash12345: 409557 is at mwdebug1002, please test and let me know if I can deploy [14:36:35] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:37:03] (03CR) 10jenkins-bot: Update the sitename of newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409557 (https://phabricator.wikimedia.org/T186952) (owner: 10Tulsi Bhagat) [14:37:49] zeljkof: Tested, Please deploy [14:37:57] Jayprakash12345: deploying [14:39:23] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:409557|Update the sitename of newiki (T186952)]] (duration: 00m 55s) [14:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] T186952: Update the sitename of newiki - https://phabricator.wikimedia.org/T186952 [14:39:54] Jayprakash12345: deployed, please check and thanks for deploying with #releng ;) [14:40:17] !log EU SWAT finished [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:53] zeljkof: Thanks, Everthing is ok. [14:41:00] (03CR) 10MarcoAurelio: Initial configuration for romdwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [14:41:14] (03CR) 10Filippo Giunchedi: [C: 031] Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [14:41:43] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10User-Elukey: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3985688 (10mmodell) @elukey: I'm working on it. I'll create separate tasks for each actionable,... [14:44:14] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:44:25] <_joe_> godog: ^^ [14:44:32] (03PS6) 10Filippo Giunchedi: prometheus: add check prometheus metric script [puppet] - 10https://gerrit.wikimedia.org/r/409054 (https://phabricator.wikimedia.org/T181410) [14:44:41] wah wah [14:44:57] (03CR) 10Alexandros Kosiaris: [C: 032] "Makes sense that we would want to use the security origin by default. PCC also looks fine, I am merging this" [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [14:45:03] (03PS8) 10Alexandros Kosiaris: Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [14:45:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Use security mirrors in cowbuilder apt config [puppet] - 10https://gerrit.wikimedia.org/r/412886 (owner: 10Muehlenhoff) [14:45:34] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:45:58] <_joe_> godog: same on thumbor1003 [14:46:05] <_joe_> maybe revert/disable puppet? [14:46:14] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:47:45] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:10] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 304 bytes in 0.001 second response time [14:48:10] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([thumbor1002.eqiad.wmnet]) [14:48:21] <_joe_> godog: revert? [14:48:24] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor1002.eqiad.wmnet are marked down but pooled [14:48:24] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor1002.eqiad.wmnet are marked down but pooled [14:48:24] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:26] <_joe_> whatever you did before [14:48:37] * volans checking [14:48:38] sigh [14:48:39] yesh [14:48:44] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor1002.eqiad.wmnet are marked down but pooled [14:48:50] (03PS1) 10Giuseppe Lavagetto: Revert "Give officewiki read access to Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/412920 [14:49:03] <_joe_> godog: ^^ [14:49:17] (03CR) 10Filippo Giunchedi: [C: 031] Revert "Give officewiki read access to Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/412920 (owner: 10Giuseppe Lavagetto) [14:49:19] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.007 second response time [14:49:19] _joe_: thanks [14:49:20] I'm running puppet on thumbor1001 and so far it correctly starts all services [14:49:22] Parent received signal 15 [14:49:24] PROBLEM - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:49:24] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [14:49:30] with the puppet run [14:49:35] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([thumbor2002.codfw.wmnet]) [14:49:35] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Give officewiki read access to Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/412920 (owner: 10Giuseppe Lavagetto) [14:49:38] (03PS2) 10Giuseppe Lavagetto: Revert "Give officewiki read access to Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/412920 [14:49:40] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Revert "Give officewiki read access to Thumbor" [puppet] - 10https://gerrit.wikimedia.org/r/412920 (owner: 10Giuseppe Lavagetto) [14:49:43] _joe_: probably no need to revert [14:49:44] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor2002.codfw.wmnet are marked down but pooled [14:49:45] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:49:52] <_joe_> oh heh [14:49:54] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor2002.codfw.wmnet are marked down but pooled [14:49:56] too late [14:49:58] :P [14:50:00] <_joe_> so should I wait? [14:50:05] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [14:50:06] <_joe_> akosiaris: it's still not puppet-merged [14:50:31] (03PS1) 10BBlack: eqsin: configure cache storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/412921 (https://phabricator.wikimedia.org/T156027) [14:50:31] !log running puppet on thumbor1002 (was already logged in) [14:50:34] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:50:41] <_joe_> godog: indeed I restarted thumbor on 1003 and it seems to work [14:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:45] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational [14:50:53] <_joe_> let's restart thumbor everywhere? [14:50:57] (03CR) 10jerkins-bot: [V: 04-1] eqsin: configure cache storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/412921 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:51:05] yeah, same on 1001, it slowly starts all instances one by one [14:51:09] _joe_: or just run puppet [14:51:18] puppet run took 2.5 mins, though [14:51:24] godog: are you looking into it or should someone else take over? [14:51:32] <_joe_> do that with systemctl, in case [14:51:32] paravoid: I'm looking into it [14:51:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:51:35] k [14:51:47] (03PS2) 10BBlack: eqsin: configure cache storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/412921 (https://phabricator.wikimedia.org/T156027) [14:51:54] _joe_: please puppet-merge the revert [14:51:54] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [14:52:02] <_joe_> godog: ok [14:52:10] I'll run puppet after that's done [14:52:24] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([thumbor2002.codfw.wmnet]) [14:52:38] <_joe_> godog: done [14:53:05] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [14:53:09] kk [14:53:34] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational [14:53:39] !log roll-restart thumbor after rollback [14:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:55] PROBLEM - Check systemd state on thumbor1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:55:02] <_joe_> godog: I guess you need to run puppet, right? [14:55:38] yes that's what's happening too [14:56:00] PROBLEM - LVS HTTP IPv4 on thumbor.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 304 bytes in 0.073 second response time [14:56:10] again? [14:56:34] PROBLEM - thumbor@8819 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8819 is failed [14:56:34] PROBLEM - thumbor@8829 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8829 is failed [14:56:34] PROBLEM - thumbor@8847 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8847 is failed [14:56:34] PROBLEM - thumbor@8833 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8833 is failed [14:56:35] PROBLEM - thumbor@8807 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8807 is failed [14:56:35] PROBLEM - thumbor@8848 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8848 is failed [14:56:35] PROBLEM - thumbor@8836 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8836 is failed [14:56:35] PROBLEM - thumbor@8815 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8815 is failed [14:56:35] PROBLEM - thumbor@8809 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8809 is failed [14:56:36] PROBLEM - thumbor@8834 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8834 is failed [14:56:36] PROBLEM - thumbor@8801 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8801 is failed [14:56:37] PROBLEM - thumbor@8813 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8813 is failed [14:56:38] that's codfw, rolling restart now there as well [14:56:47] <_joe_> ok [14:56:54] PROBLEM - thumbor@8826 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8826 is failed [14:56:54] PROBLEM - thumbor@8832 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8832 is failed [14:56:55] PROBLEM - thumbor@8840 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is failed [14:56:55] PROBLEM - thumbor@8825 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8825 is failed [14:56:55] PROBLEM - thumbor@8823 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is failed [14:56:55] PROBLEM - thumbor@8845 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8845 is failed [14:56:55] PROBLEM - thumbor@8820 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8820 is failed [14:57:04] PROBLEM - thumbor@8804 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8804 is failed [14:57:07] PROBLEM - thumbor@8805 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8805 is failed [14:57:07] PROBLEM - thumbor@8828 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8828 is failed [14:57:14] PROBLEM - thumbor@8808 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8808 is failed [14:57:14] PROBLEM - thumbor@8821 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8821 is failed [14:57:14] PROBLEM - thumbor@8817 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8817 is failed [14:57:14] PROBLEM - thumbor@8802 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8802 is failed [14:57:14] PROBLEM - thumbor@8830 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8830 is failed [14:57:14] PROBLEM - thumbor@8822 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8822 is failed [14:57:14] PROBLEM - thumbor@8803 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8803 is failed [14:57:15] PROBLEM - thumbor@8835 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8835 is failed [14:57:15] PROBLEM - thumbor@8827 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8827 is failed [14:57:16] PROBLEM - thumbor@8818 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8818 is failed [14:57:16] PROBLEM - thumbor@8841 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8841 is failed [14:57:17] PROBLEM - thumbor@8824 service on thumbor2004 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8824 is failed [14:57:45] PROBLEM - thumbor@8801 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8801 is failed [14:57:45] PROBLEM - thumbor@8838 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8838 is failed [14:57:45] PROBLEM - thumbor@8836 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8836 is failed [14:57:45] PROBLEM - thumbor@8815 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8815 is failed [14:57:49] <_joe_> oh the joy [14:57:54] PROBLEM - thumbor@8828 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8828 is failed [14:57:54] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [14:57:54] PROBLEM - thumbor@8834 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8834 is deactivating [14:57:56] PROBLEM - thumbor@8823 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is failed [14:57:56] PROBLEM - thumbor@8821 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8821 is failed [14:57:56] PROBLEM - thumbor@8840 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is failed [14:57:56] PROBLEM - thumbor@8839 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8839 is failed [14:57:56] <_joe_> :P [14:57:56] PROBLEM - thumbor@8827 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8827 is failed [14:57:56] PROBLEM - thumbor@8819 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8819 is failed [14:57:56] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [14:58:04] RECOVERY - Check systemd state on thumbor1004 is OK: OK - running: The system is fully operational [14:58:04] PROBLEM - thumbor@8835 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8835 is failed [14:58:05] PROBLEM - thumbor@8806 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8806 is failed [14:58:05] PROBLEM - thumbor@8826 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8826 is failed [14:58:14] PROBLEM - thumbor@8807 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8807 is failed [14:58:15] PROBLEM - thumbor@8812 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8812 is failed [14:58:15] PROBLEM - thumbor@8817 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8817 is failed [14:58:15] PROBLEM - thumbor@8805 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8805 is failed [14:58:15] PROBLEM - thumbor@8825 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8825 is failed [14:58:15] PROBLEM - thumbor@8818 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8818 is failed [14:58:15] PROBLEM - thumbor@8824 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8824 is failed [14:58:16] PROBLEM - thumbor@8833 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8833 is failed [14:58:23] sigh, let me silence that [14:58:24] PROBLEM - thumbor@8816 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8816 is failed [14:58:24] PROBLEM - thumbor@8802 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8802 is failed [14:58:24] PROBLEM - thumbor@8814 service on thumbor2001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8814 is failed [14:58:34] RECOVERY - Check systemd state on thumbor2002 is OK: OK - running: The system is fully operational [14:58:34] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [14:58:46] RECOVERY - thumbor@8838 service on thumbor2001 is OK: OK - thumbor@8838 is active [14:58:46] RECOVERY - thumbor@8836 service on thumbor2001 is OK: OK - thumbor@8836 is active [14:58:46] RECOVERY - thumbor@8801 service on thumbor2001 is OK: OK - thumbor@8801 is active [14:58:46] RECOVERY - thumbor@8815 service on thumbor2001 is OK: OK - thumbor@8815 is active [14:58:54] RECOVERY - thumbor@8828 service on thumbor2001 is OK: OK - thumbor@8828 is active [14:58:54] RECOVERY - thumbor@8834 service on thumbor2001 is OK: OK - thumbor@8834 is active [14:58:55] RECOVERY - thumbor@8826 service on thumbor2004 is OK: OK - thumbor@8826 is active [14:58:55] RECOVERY - thumbor@8823 service on thumbor2001 is OK: OK - thumbor@8823 is active [14:59:04] RECOVERY - thumbor@8821 service on thumbor2001 is OK: OK - thumbor@8821 is active [14:59:04] RECOVERY - thumbor@8839 service on thumbor2001 is OK: OK - thumbor@8839 is active [14:59:04] RECOVERY - thumbor@8819 service on thumbor2001 is OK: OK - thumbor@8819 is active [14:59:04] RECOVERY - thumbor@8827 service on thumbor2001 is OK: OK - thumbor@8827 is active [14:59:04] RECOVERY - thumbor@8840 service on thumbor2001 is OK: OK - thumbor@8840 is active [14:59:09] RECOVERY - LVS HTTP IPv4 on thumbor.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.075 second response time [14:59:11] RECOVERY - thumbor@8835 service on thumbor2001 is OK: OK - thumbor@8835 is active [14:59:11] RECOVERY - thumbor@8806 service on thumbor2001 is OK: OK - thumbor@8806 is active [14:59:14] RECOVERY - thumbor@8826 service on thumbor2001 is OK: OK - thumbor@8826 is active [14:59:24] RECOVERY - thumbor@8807 service on thumbor2001 is OK: OK - thumbor@8807 is active [14:59:24] RECOVERY - thumbor@8812 service on thumbor2001 is OK: OK - thumbor@8812 is active [14:59:24] RECOVERY - thumbor@8817 service on thumbor2001 is OK: OK - thumbor@8817 is active [14:59:24] RECOVERY - thumbor@8805 service on thumbor2001 is OK: OK - thumbor@8805 is active [14:59:24] RECOVERY - thumbor@8825 service on thumbor2001 is OK: OK - thumbor@8825 is active [14:59:24] RECOVERY - thumbor@8824 service on thumbor2001 is OK: OK - thumbor@8824 is active [14:59:24] RECOVERY - thumbor@8818 service on thumbor2001 is OK: OK - thumbor@8818 is active [14:59:25] RECOVERY - thumbor@8833 service on thumbor2001 is OK: OK - thumbor@8833 is active [14:59:34] RECOVERY - thumbor@8816 service on thumbor2001 is OK: OK - thumbor@8816 is active [14:59:35] RECOVERY - thumbor@8802 service on thumbor2001 is OK: OK - thumbor@8802 is active [14:59:35] RECOVERY - thumbor@8814 service on thumbor2001 is OK: OK - thumbor@8814 is active [14:59:35] RECOVERY - thumbor@8812 service on thumbor2004 is OK: OK - thumbor@8812 is active [14:59:35] RECOVERY - thumbor@8844 service on thumbor2004 is OK: OK - thumbor@8844 is active [14:59:35] RECOVERY - thumbor@8819 service on thumbor2004 is OK: OK - thumbor@8819 is active [14:59:35] RECOVERY - thumbor@8829 service on thumbor2004 is OK: OK - thumbor@8829 is active [14:59:44] (03CR) 10BBlack: [C: 031] "PCC as expected, only eqsin nodes have changes: https://puppet-compiler.wmflabs.org/compiler02/10051/" [puppet] - 10https://gerrit.wikimedia.org/r/412921 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:59:44] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [14:59:44] RECOVERY - thumbor@8847 service on thumbor2004 is OK: OK - thumbor@8847 is active [14:59:44] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational [14:59:44] RECOVERY - thumbor@8833 service on thumbor2004 is OK: OK - thumbor@8833 is active [14:59:44] RECOVERY - thumbor@8848 service on thumbor2004 is OK: OK - thumbor@8848 is active [14:59:44] RECOVERY - thumbor@8807 service on thumbor2004 is OK: OK - thumbor@8807 is active [14:59:44] RECOVERY - thumbor@8809 service on thumbor2004 is OK: OK - thumbor@8809 is active [14:59:45] RECOVERY - thumbor@8836 service on thumbor2004 is OK: OK - thumbor@8836 is active [14:59:45] RECOVERY - thumbor@8834 service on thumbor2004 is OK: OK - thumbor@8834 is active [14:59:46] RECOVERY - thumbor@8815 service on thumbor2004 is OK: OK - thumbor@8815 is active [14:59:46] RECOVERY - thumbor@8801 service on thumbor2004 is OK: OK - thumbor@8801 is active [14:59:47] RECOVERY - thumbor@8838 service on thumbor2004 is OK: OK - thumbor@8838 is active [15:00:03] meh, sorry about the spam -- I'll make sure to have an incident report [15:00:04] RECOVERY - thumbor@8832 service on thumbor2004 is OK: OK - thumbor@8832 is active [15:00:04] RECOVERY - thumbor@8840 service on thumbor2004 is OK: OK - thumbor@8840 is active [15:00:05] RECOVERY - thumbor@8825 service on thumbor2004 is OK: OK - thumbor@8825 is active [15:00:05] RECOVERY - thumbor@8823 service on thumbor2004 is OK: OK - thumbor@8823 is active [15:00:05] RECOVERY - thumbor@8845 service on thumbor2004 is OK: OK - thumbor@8845 is active [15:00:05] RECOVERY - thumbor@8820 service on thumbor2004 is OK: OK - thumbor@8820 is active [15:00:05] RECOVERY - thumbor@8804 service on thumbor2004 is OK: OK - thumbor@8804 is active [15:00:14] RECOVERY - thumbor@8805 service on thumbor2004 is OK: OK - thumbor@8805 is active [15:00:14] RECOVERY - thumbor@8828 service on thumbor2004 is OK: OK - thumbor@8828 is active [15:00:15] RECOVERY - thumbor@8808 service on thumbor2004 is OK: OK - thumbor@8808 is active [15:00:15] RECOVERY - thumbor@8821 service on thumbor2004 is OK: OK - thumbor@8821 is active [15:00:15] RECOVERY - thumbor@8817 service on thumbor2004 is OK: OK - thumbor@8817 is active [15:00:24] RECOVERY - thumbor@8802 service on thumbor2004 is OK: OK - thumbor@8802 is active [15:00:25] RECOVERY - thumbor@8830 service on thumbor2004 is OK: OK - thumbor@8830 is active [15:00:25] RECOVERY - thumbor@8822 service on thumbor2004 is OK: OK - thumbor@8822 is active [15:00:25] RECOVERY - thumbor@8803 service on thumbor2004 is OK: OK - thumbor@8803 is active [15:00:25] RECOVERY - thumbor@8827 service on thumbor2004 is OK: OK - thumbor@8827 is active [15:00:25] RECOVERY - thumbor@8835 service on thumbor2004 is OK: OK - thumbor@8835 is active [15:00:25] RECOVERY - thumbor@8818 service on thumbor2004 is OK: OK - thumbor@8818 is active [15:00:25] RECOVERY - thumbor@8841 service on thumbor2004 is OK: OK - thumbor@8841 is active [15:00:25] RECOVERY - thumbor@8824 service on thumbor2004 is OK: OK - thumbor@8824 is active [15:00:26] RECOVERY - thumbor@8810 service on thumbor2004 is OK: OK - thumbor@8810 is active [15:00:26] RECOVERY - thumbor@8806 service on thumbor2004 is OK: OK - thumbor@8806 is active [15:00:27] RECOVERY - thumbor@8839 service on thumbor2004 is OK: OK - thumbor@8839 is active [15:00:37] seems look good now, what was failing- systemd on thumbor? [15:01:06] hehe, thumbor [15:01:28] yeah, whatever, I meant that it didn't start, or was restarted? [15:01:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:01:47] puppet restarted it? [15:02:05] (03PS2) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [15:02:25] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [15:02:34] (03CR) 10jerkins-bot: [V: 04-1] wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [15:03:59] (03PS3) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [15:06:27] jynus: it was restarted as part of a deploy, plus I didn't roll it out in phases so hilarity ensues [15:07:24] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:08:22] (03PS4) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [15:10:29] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey, 10Wikimedia-Incident: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3985850 (10mmodell) p:05Triage>03High [15:10:34] <_joe_> !log installing python-conftool on puppetmasters, cumin masters [15:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] <_joe_> !log upgrading conftool on the maps cluster [15:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:18] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey, 10Wikimedia-Incident: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3985875 (10elukey) I'd personally just restart apache2 once every week in a low-traffic time of the day rath... [15:18:14] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey, 10Wikimedia-Incident: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3985884 (10mmodell) Every week would leave a ton of resources tied up in the mean time - those processes are... [15:18:37] (03CR) 10BBlack: [C: 032] eqsin: configure cache storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/412921 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [15:19:40] <_joe_> !log upgrading conftool on the mediawiki appservers [15:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:11] (03PS5) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [15:20:28] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey, 10Wikimedia-Incident: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3985898 (10elukey) I would proceed with the simplest solution first, then see how it goes and refine if need... [15:21:25] (03PS1) 10Gilles: Add Thumbor/Mediawiki shared secret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) [15:21:42] (03PS1) 10Alexandros Kosiaris: Revert "Use security mirrors in cowbuilder apt config" [puppet] - 10https://gerrit.wikimedia.org/r/412929 [15:21:49] (03PS2) 10Alexandros Kosiaris: Revert "Use security mirrors in cowbuilder apt config" [puppet] - 10https://gerrit.wikimedia.org/r/412929 [15:21:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Use security mirrors in cowbuilder apt config" [puppet] - 10https://gerrit.wikimedia.org/r/412929 (owner: 10Alexandros Kosiaris) [15:22:20] (03CR) 10Gilles: [C: 04-1] "Shared secret set up in beta but not in production yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [15:23:17] <_joe_> !log upgrading conftool on aqs, restbase, ores clusters [15:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:53] (03CR) 10Volans: [C: 04-2] "[on hold] Requires cumin >= 3.0.0 installed in prod that indirectly requires python3-conftool >= 1.0.0." [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/405879 (owner: 10Volans) [15:24:09] (03CR) 10Volans: [C: 04-2] "[on hold] Requires cumin >= 3.0.0 installed in prod that indirectly requires python3-conftool >= 1.0.0." [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) (owner: 10Volans) [15:25:41] <_joe_> !log upgrading conftool on parsoid,wdqs [15:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:46] (03PS6) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [15:27:07] <_joe_> !log upgrading conftool on swift proxies, thumbor [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:43] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Use security mirrors in cowbuilder apt config"" [puppet] - 10https://gerrit.wikimedia.org/r/412930 [15:29:48] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey, 10Wikimedia-Incident: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3985939 (10mmodell) Well, that really depends on how you define "needed." The proposed script is definitely... [15:34:33] (03CR) 10Gehel: [C: 04-1] "Puppet compiler looks happy. Still needs an update to the runUpdate.sh scripts on the WDQS side." [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [15:36:01] !log eqsin: restarting all varnish backends for storage changes (not in prod traffic flow, yet!) [15:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:15] !log oblivian@puppetmaster1001 conftool action : edit; selector: dc=esams,name=cp3033.esams.wmnet [15:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:20] (03CR) 10Ottomata: [C: 031] "Nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/411192 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [16:04:22] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labtestneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3986141 (10Papaul) @chasemp Unfortunately I am afraid this can not be done on labtestneutron2001 because it has only 2 NIS's (eth0 and eth1) . Do you want me to connect only labtes... [16:05:25] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labtestneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3986145 (10chasemp) Gah, sure yes please and that will get me started. [16:07:56] (03PS1) 10Gilles: Avoid Puppet restarting Thumbor instances agressively [puppet] - 10https://gerrit.wikimedia.org/r/412934 (https://phabricator.wikimedia.org/T169144) [16:10:20] (03PS1) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) [16:10:50] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:10:54] (03CR) 10Filippo Giunchedi: [C: 031] Avoid Puppet restarting Thumbor instances agressively [puppet] - 10https://gerrit.wikimedia.org/r/412934 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:10:56] 10Operations, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3986158 (10Cmjohnson) @ayounsi row B sfp-t's are populated. [16:11:04] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labtestneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3986159 (10Papaul) a:05Papaul>03chasemp labtestneutron2002 asw-b5-codfw ge-5/0/11 cable ID:6103 [16:11:06] (03CR) 10Filippo Giunchedi: [C: 032] Avoid Puppet restarting Thumbor instances agressively [puppet] - 10https://gerrit.wikimedia.org/r/412934 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:12:35] 10Operations, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3986166 (10Cmjohnson) [16:14:09] !log gilles@tin Synchronized private/PrivateSettings.php: Add Thumbor secret to Swift configuration (duration: 00m 56s) [16:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:22] (03CR) 10Filippo Giunchedi: Give officewiki read access to Thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:14:43] (03PS1) 10Jcrespo: mariadb: Remove s3 from dbstore2001 to save space [puppet] - 10https://gerrit.wikimedia.org/r/412938 (https://phabricator.wikimedia.org/T186596) [16:15:05] (03PS2) 10Jcrespo: mariadb: Remove s3 from dbstore2001 to save space [puppet] - 10https://gerrit.wikimedia.org/r/412938 (https://phabricator.wikimedia.org/T186596) [16:16:40] (03CR) 10Marostegui: [C: 031] mariadb: Remove s3 from dbstore2001 to save space [puppet] - 10https://gerrit.wikimedia.org/r/412938 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [16:17:40] !log drop s3 from dbstore2001 [16:17:48] (03CR) 10Jcrespo: [C: 032] mariadb: Remove s3 from dbstore2001 to save space [puppet] - 10https://gerrit.wikimedia.org/r/412938 (https://phabricator.wikimedia.org/T186596) (owner: 10Jcrespo) [16:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:06] (03Abandoned) 10BBlack: Add new misc-web-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/370201 (https://phabricator.wikimedia.org/T170518) (owner: 10BBlack) [16:18:11] (03Abandoned) 10BBlack: Add new git-ssh IPs [puppet] - 10https://gerrit.wikimedia.org/r/370202 (https://phabricator.wikimedia.org/T170518) (owner: 10BBlack) [16:18:49] (03Abandoned) 10BBlack: Add LVS nonzero ranges in network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/370210 (https://phabricator.wikimedia.org/T170518) (owner: 10BBlack) [16:19:24] (03Abandoned) 10BBlack: Reserve non zero rated IPs and ranges [dns] - 10https://gerrit.wikimedia.org/r/370094 (https://phabricator.wikimedia.org/T170518) (owner: 10Ayounsi) [16:20:08] 10Operations, 10Traffic, 10Patch-For-Review: Non zero rated LVS IPs - https://phabricator.wikimedia.org/T170518#3986229 (10BBlack) 05stalled>03declined In light of: https://blog.wikimedia.org/2018/02/16/partnerships-new-approach/ , we're not going to restructure public subnets around this, as that has lo... [16:21:16] (03PS2) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) [16:22:07] (03CR) 10Gilles: Give officewiki read access to Thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:22:08] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:22:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#3986235 (10RobH) p:05Triage>03Normal [16:23:29] (03PS1) 10Jcrespo: dbstore: Remove s3 from the list of locally dumped sections [puppet] - 10https://gerrit.wikimedia.org/r/412942 [16:23:55] (03PS2) 10Jcrespo: dbstore: Remove s3 from the list of locally dumped sections [puppet] - 10https://gerrit.wikimedia.org/r/412942 [16:24:04] (03CR) 10Jcrespo: [C: 032] dbstore: Remove s3 from the list of locally dumped sections [puppet] - 10https://gerrit.wikimedia.org/r/412942 (owner: 10Jcrespo) [16:24:52] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3986247 (10Papaul) p:05Triage>03Normal [16:25:39] !log installing initramfs-tools update from jessie point release [16:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:25] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3986274 (10Papaul) a:05Papaul>03RobH @RobH Can you please review rack plan for the new wdqs systems and confirm if is is okay and assigned back to me. Thanks. [16:26:30] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler02/10056/thumbor2002.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:27:30] (03PS3) 10Filippo Giunchedi: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:27:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Move labstore1006 and 1007 to 10G enabled racks in row A & D - https://phabricator.wikimedia.org/T186756#3986279 (10madhuvishy) 05Open>03Resolved The servers are moved and up and running! Thanks for your work @Cmjohnson. [16:28:06] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:29:03] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/412935 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [16:29:23] I've just noticed now that it's called/rewritten to jeRkins-bot here, haha [16:29:39] gilles: it's only jerkins when tests fail ;) [16:29:47] nice [16:31:20] TIL a jerkin is actually a thing [16:32:02] the gift that keeps on giving [16:33:42] !log roll-restart thumbor in codfw/eqiad to apply https://gerrit.wikimedia.org/r/412935 [16:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:49] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Clean up of s3 from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/412945 [16:37:16] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Clean up of s3 from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/412945 [16:38:40] (03CR) 10Jcrespo: [C: 032] prometheus-mysqld-exporter: Clean up of s3 from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/412945 (owner: 10Jcrespo) [16:41:39] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986351 (10Cmjohnson) [16:43:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3925325 (10Cmjohnson) @Marostegui This server only has 2 4TB disks, with no raid card. This will need a software raid. Let me know if you want to reconsid... [16:44:11] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3986371 (10Cmjohnson) @robh does this still need idrac setup? [16:45:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986376 (10jcrespo) ^that was my main reason to object to not call it db*, all db* hosts have a hardware raid and are, to some extent, interchangable. [16:46:35] (03PS1) 10Zoranzoki21: Added throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [16:47:17] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3986385 (10Cmjohnson) Confirmed 10 each assigning to @Andrew to resolve if satisfied [16:47:39] (03PS2) 10Zoranzoki21: Added throttle rule for Wikipedia Women in Red editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412947 (https://phabricator.wikimedia.org/T187803) [16:47:45] 10Operations: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3986390 (10elukey) [16:48:07] (03PS1) 10Filippo Giunchedi: thumbor: stop checking individual instance status [puppet] - 10https://gerrit.wikimedia.org/r/412948 [16:49:07] (03PS5) 10Jcrespo: Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 [16:49:09] (03PS1) 10Jcrespo: Add mariadb package changes for 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/412949 [16:49:12] (03PS1) 10Jcrespo: Remove dbstore2001 from the s3 host list [software] - 10https://gerrit.wikimedia.org/r/412950 [16:49:45] 10Operations: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3986427 (10elukey) [16:49:47] (03CR) 10Jcrespo: [V: 032 C: 032] Add mariadb package changes for 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/412949 (owner: 10Jcrespo) [16:50:05] (03CR) 10Jcrespo: [V: 032 C: 032] Remove dbstore2001 from the s3 host list [software] - 10https://gerrit.wikimedia.org/r/412950 (owner: 10Jcrespo) [16:50:16] (03PS2) 10Jcrespo: Remove dbstore2001 from the s3 host list [software] - 10https://gerrit.wikimedia.org/r/412950 [16:50:20] (03CR) 10Jcrespo: [V: 032 C: 032] Remove dbstore2001 from the s3 host list [software] - 10https://gerrit.wikimedia.org/r/412950 (owner: 10Jcrespo) [16:50:40] (03PS2) 10Jcrespo: Add mariadb package changes for 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/412949 [16:50:43] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3986390 (10elukey) [16:50:47] (03CR) 10Jcrespo: [V: 032 C: 032] Add mariadb package changes for 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/412949 (owner: 10Jcrespo) [16:50:54] (03PS19) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [16:54:06] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [16:55:08] (03PS20) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [16:56:13] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3986459 (10elukey) [16:57:10] incident report for thumbor is here btw https://wikitech.wikimedia.org/wiki/Incident_documentation/20180220-thumbor [16:57:18] <_joe_> godog: thanks [16:57:27] (03PS4) 10Zoranzoki21: Disable Flow extension on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) [16:57:41] (03CR) 10Zoranzoki21: "Please remove -2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [16:58:11] (03PS2) 10Filippo Giunchedi: thumbor: stop checking individual instance status [puppet] - 10https://gerrit.wikimedia.org/r/412948 [16:58:26] looks good godog [16:58:50] (03PS1) 10Gilles: Serve officewiki thumbnails with Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412952 (https://phabricator.wikimedia.org/T169144) [16:58:51] (briefly followed the outage but the incident report is super clear) [16:59:07] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:59:31] sweet, thanks _joe_ elukey [17:00:04] godog, moritzm, and _joe_: How many deployers does it take to do Puppet SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180220T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:11] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986472 (10Cmjohnson) It's not too late to rename it to tendril1001...easy dns change right now [17:01:26] 10Operations, 10netops: cr1-eqsin faulty interfaces - https://phabricator.wikimedia.org/T187807#3986478 (10ayounsi) [17:01:37] godog: the only thing to get 10+ would be to have also a brief note about what was the user impact during the outage (just to allow people not familiar with thumbor to undestand) [17:01:43] no patches -> https://i.imgur.com/W3GhAuf.mp4 [17:01:58] aahah [17:02:27] elukey: yeah that's a good idea, I'll mention that too [17:03:21] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: stop checking individual instance status [puppet] - 10https://gerrit.wikimedia.org/r/412948 (owner: 10Filippo Giunchedi) [17:05:45] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986520 (10Marostegui) >>! In T185788#3986376, @jcrespo wrote: > ^that was my main reason to object to not call it db*, all db* hosts have a hardware raid a... [17:05:59] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986522 (10jcrespo) no, tendril is definitely not ok. [17:06:07] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [17:07:06] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986529 (10jcrespo) I did not reopen the discussion this, Chris did. [17:08:07] (03PS1) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412955 (https://phabricator.wikimedia.org/T166733) [17:08:31] (03CR) 10Anomie: [C: 032] "Config change, previously discussed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412955 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [17:09:06] (03PS2) 10Ema: etcd: Introduce reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/411264 (https://phabricator.wikimedia.org/T169765) [17:09:55] (03Merged) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412955 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [17:09:57] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [17:10:07] (03CR) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412955 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [17:10:47] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4031 is CRITICAL: connect to address 10.128.0.131 and port 3128: Connection refused [17:11:08] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Setting wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on group 1 (duration: 00m 56s) [17:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:41] (03CR) 10Ema: "> Patch Set 1: Code-Review+1" [debs/pybal] - 10https://gerrit.wikimedia.org/r/411264 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema) [17:11:47] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4031 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.158 second response time [17:13:28] elukey: btw the burrow change broke puppet on sites not codfw/eqiad [17:15:52] godog: lovely, because of the absence of default in the ? I guess [17:16:35] where is puppet broken? [17:17:02] bast[345]* [17:17:09] yes I think because no default for ? [17:17:27] ah lovely I wasn't aware that puppet would have run on those too [17:17:32] * elukey ignorant [17:17:55] godog: since the targets for main-codfw seems not there, shall I simply remove it for the moment? [17:17:59] <_joe_> elukey: yeah, we keep bastions unpuppetized for security reasons [17:18:01] <_joe_> :P [17:18:27] _joe_ you are so kind as always :D [17:18:47] <_joe_> elukey: tag always there [17:18:48] elukey: what's "it" ? [17:19:28] godog: the main-codfw burrow config [17:19:45] CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [17:19:48] elukey: sure that'll fix it too [17:22:36] 10Operations, 10Traffic: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3985419 (10BBlack) This could potentially be a large contributor to memory pressure issues we run into elsewhere, as well (and the inconsistencies around these, which may have to do with average reloads rates vs rest... [17:23:27] (03PS1) 10Elukey: role::prometheus::ops: remove broken Kafka burrow config [puppet] - 10https://gerrit.wikimedia.org/r/412959 (https://phabricator.wikimedia.org/T180442) [17:23:44] godog: --^ [17:24:08] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::ops: remove broken Kafka burrow config [puppet] - 10https://gerrit.wikimedia.org/r/412959 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [17:24:13] elukey: looks good! thanks [17:24:29] (03CR) 10Elukey: [C: 032] role::prometheus::ops: remove broken Kafka burrow config [puppet] - 10https://gerrit.wikimedia.org/r/412959 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [17:27:43] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986656 (10Marostegui) @Cmjohnson, please proceed with db1115 as @jcrespo and myself agreed on that hostname yesterday in our weekly meeting. [17:27:49] (03CR) 10Rush: "So atm I think this will create this check for every instance, even though it's always checking for the same thing which is toolforge spec" [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio) [17:29:02] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:29:44] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3986665 (10Cmjohnson) [17:30:04] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3986668 (10Cmjohnson) [17:30:24] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3986669 (10Cmjohnson) [17:30:51] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3986674 (10Cmjohnson) [17:31:02] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:32:03] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:35:02] RECOVERY - puppet last run on mc1026 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:36:52] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:37:23] 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3986706 (10MoritzMuehlenhoff) 05Open>03Resolved This is complete [17:40:46] !log andrew@tin Started deploy [striker/deploy@3684a73]: rolling stretch-ready striker out to labweb hosts [17:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:41] !log andrew@tin Finished deploy [striker/deploy@3684a73]: rolling stretch-ready striker out to labweb hosts (duration: 00m 55s) [17:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:28] 10Operations, 10ops-codfw, 10netops: codfw: mgmt switch replacement in D4 - https://phabricator.wikimedia.org/T187816#3986725 (10Papaul) p:05Triage>03Normal [17:42:47] (03CR) 10Arturo Borrero Gonzalez: "We the team agreed to drop this patch by now." [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [17:43:00] (03Abandoned) 10Arturo Borrero Gonzalez: toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [17:43:58] 10Operations, 10ops-codfw, 10netops: codfw: mgmt switch replacement in D4 - https://phabricator.wikimedia.org/T187816#3986741 (10Papaul) [17:44:12] (03PS1) 10Andrew Bogott: striker: on Stretch, include libmariadbclient18 [puppet] - 10https://gerrit.wikimedia.org/r/412962 [17:44:43] (03CR) 10Andrew Bogott: [C: 032] striker: on Stretch, include libmariadbclient18 [puppet] - 10https://gerrit.wikimedia.org/r/412962 (owner: 10Andrew Bogott) [17:47:14] !log ppchelko@tin Started deploy [restbase/deploy@dca0290]: Switch summary implementation to MCS T179875 [17:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:26] T179875: Update RESTBase to get summary content from MCS Summary 1.3 endpoint when development is complete - https://phabricator.wikimedia.org/T179875 [17:50:32] !log mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=officewiki --backend=local-multiwrite --private [17:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:06] anomie: how is BOTH working? everthing good? [17:52:14] !log installing cups updates from jessie point release [17:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:34] jynus: Everything looks good so far, I haven't seen any uptick in errors or slow queries. [17:53:04] with both, how reads work, do they try the new system first? [17:53:18] or do they keep reads with the old field? [17:53:27] Yes, they try the new first and then fall back to the old. [17:53:31] cool [17:53:43] I will keep an eye on sizes [17:55:02] (03PS1) 10Andrew Bogott: striker: update db grants for new labweb services [puppet] - 10https://gerrit.wikimedia.org/r/412964 [17:56:04] (03PS1) 10Ayounsi: Icinga: add asw2-b-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412965 [17:56:43] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:56:54] (03CR) 10Ayounsi: [C: 032] Icinga: add asw2-b-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412965 (owner: 10Ayounsi) [17:57:33] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#3986839 (10Eevans) Confirmed in the Services team meeting today; These machines can be decommissioned at the earliest convenience! [17:57:49] 10Operations, 10ops-codfw, 10DC-Ops: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3986842 (10Eevans) Confirmed in the Services team meeting today; These machines can be decommissioned at the earliest convenience! [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180220T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:18] Nothing for ORES today. [18:01:41] 10Operations, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3986860 (10ayounsi) [18:02:51] (03PS3) 10Papaul: DNS: Add production DNS entry for db2093 [dns] - 10https://gerrit.wikimedia.org/r/407454 [18:03:14] !log ppchelko@tin Finished deploy [restbase/deploy@dca0290]: Switch summary implementation to MCS T179875 (duration: 16m 01s) [18:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:27] T179875: Update RESTBase to get summary content from MCS Summary 1.3 endpoint when development is complete - https://phabricator.wikimedia.org/T179875 [18:08:16] ar.lolra will be deploying parsoid patches. [18:09:44] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986867 (10Cmjohnson) @marostegui Okay, since I cannot do standard DB raid...any suggestions? [18:10:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986868 (10jcrespo) Do nothing, we (the recipe) will install the RAID1 in software. [18:11:47] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 39.06, 36.60, 32.61 [18:12:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986878 (10jcrespo) ``` $ git grep db1115 modules/install_server/files/autoinstall/netboot.cfg: db1115|db2093) echo partman/raid1-gpt.cfg ;; \ ``` [18:12:20] (03PS2) 10Andrew Bogott: m5: update db grants for new labweb services [puppet] - 10https://gerrit.wikimedia.org/r/412964 [18:12:23] (03PS1) 10Andrew Bogott: m5: add ferm rules for new labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/412970 (https://phabricator.wikimedia.org/T168470) [18:12:48] (03CR) 10jerkins-bot: [V: 04-1] m5: update db grants for new labweb services [puppet] - 10https://gerrit.wikimedia.org/r/412964 (owner: 10Andrew Bogott) [18:13:53] (03CR) 10Jcrespo: [C: 031] m5: add ferm rules for new labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/412970 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:14:31] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986883 (10Cmjohnson) [18:14:33] (03PS3) 10Andrew Bogott: m5: update db grants for new labweb services [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) [18:14:35] (03PS2) 10Andrew Bogott: m5: add ferm rules for new labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/412970 (https://phabricator.wikimedia.org/T168470) [18:15:42] (03CR) 10Muehlenhoff: [C: 031] m5: add ferm rules for new labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/412970 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:15:49] (03CR) 10Jcrespo: [C: 031] "Looks good but let's coordinate to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:16:41] (03PS1) 10Muehlenhoff: Add library hint for sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/412972 [18:16:59] (03PS2) 10Muehlenhoff: Add library hint for sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/412972 [18:17:22] (03PS1) 10Cmjohnson: Adding dhcpd entry db1115 [puppet] - 10https://gerrit.wikimedia.org/r/412973 (https://phabricator.wikimedia.org/T185788) [18:17:57] (03CR) 10Muehlenhoff: [C: 032] Add library hint for sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/412972 (owner: 10Muehlenhoff) [18:18:12] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entry db1115 [puppet] - 10https://gerrit.wikimedia.org/r/412973 (https://phabricator.wikimedia.org/T185788) (owner: 10Cmjohnson) [18:18:21] (03PS2) 10Cmjohnson: Adding dhcpd entry db1115 [puppet] - 10https://gerrit.wikimedia.org/r/412973 (https://phabricator.wikimedia.org/T185788) [18:18:23] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding dhcpd entry db1115 [puppet] - 10https://gerrit.wikimedia.org/r/412973 (https://phabricator.wikimedia.org/T185788) (owner: 10Cmjohnson) [18:21:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3986893 (10jcrespo) [18:23:27] !log arlolra@tin Started deploy [parsoid/deploy@5fbabfc]: Updating Parsoid to e5e8113 [18:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:45] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3986906 (10RobH) Please ensure all decom requests are tagged with #hw-requests. [18:28:49] (03CR) 10Ottomata: "Cool!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/411195 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [18:29:45] (03CR) 10Ottomata: [C: 031] Simplify zookeeper's default template to be systemd friendly [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/412744 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [18:34:04] !log arlolra@tin Finished deploy [parsoid/deploy@5fbabfc]: Updating Parsoid to e5e8113 (duration: 10m 37s) [18:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:24] !log ppchelko@tin Started deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 [18:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:39] T179875: Update RESTBase to get summary content from MCS Summary 1.3 endpoint when development is complete - https://phabricator.wikimedia.org/T179875 [18:47:14] 10Operations, 10netops: Rack/cable/configure mr1-eqiad - https://phabricator.wikimedia.org/T187820#3986943 (10ayounsi) p:05Triage>03Normal [18:57:25] !log ppchelko@tin Finished deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 (duration: 14m 02s) [18:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:39] T179875: Update RESTBase to get summary content from MCS Summary 1.3 endpoint when development is complete - https://phabricator.wikimedia.org/T179875 [18:57:57] !log ppchelko@tin Started deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 take 2 [18:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:49] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services, 10Readers-Web-Kanbanana-Board: Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3986975 (10Niedzielski) [18:59:14] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services, 10Readers-Web-Kanbanana-Board: Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3986990 (10Niedzielski) [18:59:19] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#3986989 (10Niedzielski) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180220T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:07] (03CR) 10Niedzielski: [C: 04-1] "Pending resolution of https://phabricator.wikimedia.org/T187821." [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski) [19:00:43] !log ppchelko@tin Finished deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 take 2 (duration: 02m 47s) [19:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:28] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3987008 (10jcrespo) [19:05:25] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3854215 (10jcrespo) [19:05:30] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3987011 (10jcrespo) [19:05:50] (03PS1) 10Jcrespo: mariadb: Allow db2044 reimage on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412978 (https://phabricator.wikimedia.org/T183470) [19:06:23] (03PS1) 10Andrew Bogott: horizon: include 'gettext' package [puppet] - 10https://gerrit.wikimedia.org/r/412979 [19:07:21] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3987030 (10RobH) [19:07:41] (03PS2) 10Jcrespo: mariadb: Allow db2044 reimage on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412978 (https://phabricator.wikimedia.org/T183470) [19:08:11] (03PS3) 10Jcrespo: mariadb: Allow db2044 reimage on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412978 (https://phabricator.wikimedia.org/T183470) [19:08:41] (03CR) 10Jcrespo: [C: 032] mariadb: Allow db2044 reimage on stretch [puppet] - 10https://gerrit.wikimedia.org/r/412978 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [19:12:29] (03PS1) 10Gilles: Add all private wikis to swift::proxy::private_container_list [puppet] - 10https://gerrit.wikimedia.org/r/412980 (https://phabricator.wikimedia.org/T169144) [19:14:47] (03PS1) 10EBernhardson: Increase pool counter workers for cirrus namespace lookup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412982 [19:15:03] (03PS1) 10Jdrewniak: Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) [19:17:20] (03PS2) 10Andrew Bogott: horizon: include 'gettext' package [puppet] - 10https://gerrit.wikimedia.org/r/412979 [19:18:37] (03CR) 10Niedzielski: [C: 031] Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [19:18:39] (03CR) 10Andrew Bogott: [C: 032] horizon: include 'gettext' package [puppet] - 10https://gerrit.wikimedia.org/r/412979 (owner: 10Andrew Bogott) [19:26:08] !log mobrovac@tin Started restart [changeprop/deploy@5fdc03a]: (no justification provided) [19:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:59] (03CR) 10Chad: [C: 032] mw.org: Symlink keys.html to index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411367 (owner: 10Chad) [19:39:31] (03Merged) 10jenkins-bot: mw.org: Symlink keys.html to index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411367 (owner: 10Chad) [19:39:41] (03CR) 10jenkins-bot: mw.org: Symlink keys.html to index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411367 (owner: 10Chad) [19:45:09] !log demon@tin Synchronized docroot/mediawiki/keys/: symlink magic (duration: 00m 56s) [19:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:52] (03PS1) 10Andrew Bogott: horizon deploy: special weird venv sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/412987 (https://phabricator.wikimedia.org/T187811) [19:49:25] (03CR) 10jerkins-bot: [V: 04-1] horizon deploy: special weird venv sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/412987 (https://phabricator.wikimedia.org/T187811) (owner: 10Andrew Bogott) [19:52:26] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Setup cron for foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldLogIPData.php on Beta - https://phabricator.wikimedia.org/T187658#3987192 (10Reedy) [19:53:05] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services, 10Readers-Web-Kanbanana-Board: Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3987196 (10Niedzielski) @ovasileva, @phuedx this is in the sprint for tracking because T186748 is blocked on it {ico... [19:53:14] (03PS2) 10Andrew Bogott: horizon deploy: special weird venv sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/412987 (https://phabricator.wikimedia.org/T187811) [19:54:04] (03CR) 10Andrew Bogott: [C: 032] horizon deploy: special weird venv sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/412987 (https://phabricator.wikimedia.org/T187811) (owner: 10Andrew Bogott) [19:55:34] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3987207 (10chasemp) a:05chasemp>03Andrew [19:57:27] !log Cutting new branch wmf/1.31.0-wmf.22 - Deployment blockers: T183961 [19:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:39] T183961: 1.31.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T183961 [19:58:20] (03PS1) 10Rush: openstack: use labtestmetal as a virt for now [puppet] - 10https://gerrit.wikimedia.org/r/412991 (https://phabricator.wikimedia.org/T168891) [20:00:04] twentyafterfour: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180220T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:01:12] (03CR) 10Rush: [C: 032] openstack: use labtestmetal as a virt for now [puppet] - 10https://gerrit.wikimedia.org/r/412991 (https://phabricator.wikimedia.org/T168891) (owner: 10Rush) [20:01:17] (03PS2) 10Rush: openstack: use labtestmetal as a virt for now [puppet] - 10https://gerrit.wikimedia.org/r/412991 (https://phabricator.wikimedia.org/T168891) [20:01:43] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3987221 (10jcrespo) [20:01:46] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3987220 (10jcrespo) [20:02:01] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3854215 (10jcrespo) [20:04:06] (03CR) 10VolkerE: "Shouldn't this have a depends-on referral?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [20:05:30] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey, 10Wikimedia-Incident: Phabricator: Clean up deadlocked apache processes - https://phabricator.wikimedia.org/T187790#3987225 (10Dzahn) re: "simplest solution first" i would suggest we just do systemctl apache2 restart also... [20:07:13] !log andrew@tin Started deploy [horizon/deploy@6a40f84]: a couple of bug fixes [20:07:19] (03PS1) 10Jcrespo: mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) [20:07:25] !log andrew@tin Started deploy [horizon/deploy@b02c819]: a couple of bug fixes [20:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:49] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2044 from codfw-core-s4 to codfw-misc-s2 [puppet] - 10https://gerrit.wikimedia.org/r/412994 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [20:08:08] (03PS2) 10Krinkle: [WIP] errorpages: Remove unused hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) [20:10:21] !log andrew@tin Finished deploy [horizon/deploy@b02c819]: a couple of bug fixes (duration: 02m 55s) [20:10:27] (03PS1) 10Jcrespo: dbproxy: Setup db2044 as the main m2 host on codfw and monitor it [puppet] - 10https://gerrit.wikimedia.org/r/412995 (https://phabricator.wikimedia.org/T183470) [20:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:48] (03PS1) 10Pmiazga: Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) [20:14:39] !log andrew@tin Started deploy [horizon/deploy@b02c819]: trying to get a clean deploy [20:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:03] (03PS1) 10Rush: openstack: labtestmetal partition setting in hiera [puppet] - 10https://gerrit.wikimedia.org/r/412997 [20:16:08] (03PS2) 10Rush: openstack: labtestmetal partition setting in hiera [puppet] - 10https://gerrit.wikimedia.org/r/412997 [20:16:33] !log andrew@tin Finished deploy [horizon/deploy@b02c819]: trying to get a clean deploy (duration: 01m 54s) [20:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:04] (03CR) 10Rush: [C: 032] openstack: labtestmetal partition setting in hiera [puppet] - 10https://gerrit.wikimedia.org/r/412997 (owner: 10Rush) [20:17:52] !log labtestmetal mkfs -t xfs -i size=512 /dev/mapper/labtestmetal2001--vg-data [20:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:43] !log labtestmetal2001:~# aptitude install linux-image-4.4.0-109-generic && aptitude install linux-image-extra-4.4.0-109-generic [20:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:24] (03CR) 10Niedzielski: [C: 031] Disable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412996 (https://phabricator.wikimedia.org/T185973) (owner: 10Pmiazga) [20:23:56] RECOVERY - configured eth on labtestmetal2001 is OK: OK - interfaces up [20:29:06] (03CR) 10Smalyshev: wdqs: allow configuration of kafka based updates (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [20:31:36] !log twentyafterfour@tin Started scap: Sync 1.31.0-wmf.22 and promote test wikis - refs T183961 [20:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:50] T183961: 1.31.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T183961 [20:33:18] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3987278 (10Andrew) a:05Andrew>03Cmjohnson on labvirt1019: ``` => ctrl slot=0 pd all show status physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 1.6 TB): OK... [20:35:11] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3987283 (10Cmjohnson) I apologize Inreacll an IRC conversation about this. I will need to reboot them into raid bios and re-configure the raid. Any issues with... [20:42:57] (03PS2) 10Ladsgroup: Enable x-kill feature everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) [20:46:00] (03CR) 10Pmiazga: [C: 031] Removing Mobile beta feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412983 (https://phabricator.wikimedia.org/T187712) (owner: 10Jdrewniak) [20:47:12] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [20:50:19] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labtestneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3987321 (10chasemp) 05Open>03Resolved >>! In T187552#3986159, @Papaul wrote: > labtestneutron2002 > asw-b5-codfw ge-5/0/11 cable ID:6103 Thank you @papaul. Fixing the descri... [20:50:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3987326 (10Jgreen) [20:55:02] (03CR) 10Ottomata: "Joal, not adding you to review, but just to nudge a 2.2.1 update for jobs :)" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/405894 (https://phabricator.wikimedia.org/T185581) (owner: 10Ottomata) [20:57:12] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is OK: Files ownership is ok. [21:01:10] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#3987363 (10Andrew) Nope, you can reboot/rebuild them at any time. [21:03:53] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 41.58, 36.05, 32.28 [21:04:15] *sigh* - https://gerrit.wikimedia.org/r/#/c/412814/ == Error 500 [21:04:22] while trying to rebase [21:04:22] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.71, 36.05, 31.88 [21:04:59] no_justification ^^ [21:05:10] i get "Could not perform action: Internal server error" when using the cherry pick promt [21:05:15] prompt [21:05:21] as i doin't have access to rebase [21:09:21] Derp. [21:10:25] Hauskatze: Easy workaround: cherry-pick locally onto master, push rebased version [21:10:51] Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing blob 1e8a4fb032973f59f2548320b08935187ccf1c23 [21:10:58] There's a fuller stacktrace, but there ya go [21:11:23] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 47.79, 35.31, 32.17 [21:11:26] heh. [21:11:27] no_justification: I'll try that. Shall I clone mediawiki/extensions on my local machine? [21:11:41] That may take you a while to clone that repo [21:11:47] yeh... [21:11:49] + git submodule init and git submodule update [21:12:01] https://phabricator.wikimedia.org/P6722 [21:12:14] You....dont need to update the submodules? [21:12:23] Just clone the parent repo (with --depth=1 if you want) [21:12:33] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 62.46, 30.64, 22.08 [21:12:34] I wonder why it's missing an object. [21:12:36] Then change the .gitmodules and drop the directory [21:12:37] Bam [21:12:37] You win [21:12:53] or I can abandon that fu... change and publish another one [21:13:32] 10Puppet, 10cloud-services-team, 10Patch-For-Review: Install hp health tools on labvirts where appropriate - https://phabricator.wikimedia.org/T187355#3987392 (10Bstorm) 05Open>03Resolved a:03Bstorm [21:13:33] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 32.25, 27.71, 21.61 [21:13:40] but for the knowledge I'll do the --depth=1 thingy [21:13:42] hmm 1e8a4fb032973f59f2548320b08935187ccf1c23 leads me to https://gerrit.wikimedia.org/r/#/c/412259/ [21:15:04] I'm generally a fan of having that repo cloned anyway :) [21:15:09] So I have all extensions available on demand :) [21:16:13] git review -d 412814 taking its time... [21:16:31] (03PS3) 10Ottomata: Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) [21:17:13] (03CR) 10jerkins-bot: [V: 04-1] Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [21:18:36] !log twentyafterfour@tin Finished scap: Sync 1.31.0-wmf.22 and promote test wikis - refs T183961 (duration: 46m 59s) [21:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:50] T183961: 1.31.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T183961 [21:19:15] (03PS4) 10Ottomata: Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) [21:19:42] (03CR) 10jerkins-bot: [V: 04-1] Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [21:19:46] done [21:19:52] Hauskatze: Cherry pick! [21:19:59] I just made a new patch but used the same change-id [21:20:56] (03PS5) 10Ottomata: Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) [21:21:11] no_justification: I think https://gerrit.wikimedia.org/r/#/c/412814/ works fine the way I did? [21:22:24] Merged :) [21:23:53] chachi [21:24:16] so, new workaround, upload a patch keeping change-id intact :) [21:27:40] (03CR) 10Ottomata: "No config changes: https://puppet-compiler.wmflabs.org/compiler02/10057/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [21:28:05] https://github.com/wikimedia/mediawiki-extensions-SemanticPageMaker && https://github.com/wikimedia/mediawiki-extensions-WikiObjectModel -- Burninate plx ktnx [21:30:25] (03PS6) 10Ottomata: Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) [21:31:09] can somebody please run namespaceDupes for T187660 ? [21:31:10] T187660: en.wikiversity Draft Namespace Inaccessible Pages - https://phabricator.wikimedia.org/T187660 [21:31:21] (03CR) 10Ottomata: [C: 032] Parameterize webrequests from varnishkafka to jumbo Kafka cluster via TLS [puppet] - 10https://gerrit.wikimedia.org/r/404737 (https://phabricator.wikimedia.org/T175461) (owner: 10Ottomata) [21:39:47] !log ran `namespaceDupes.php --wiki=enwikiversity` for T187660 [21:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:00] T187660: en.wikiversity Draft Namespace Inaccessible Pages - https://phabricator.wikimedia.org/T187660 [21:40:06] pagelinks from=54466 ns=0 dbk=WV:NPOV -> Wikiversity:NPOV DRY RUN [21:40:07] Hauskatze: That one seems to have not gotten caught ^^^ [21:40:27] (03PS1) 1020after4: group0 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413051 [21:40:29] (03CR) 1020after4: [C: 032] group0 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413051 (owner: 1020after4) [21:40:44] Otherwise, done. [21:41:03] with --fix no_justification ? [21:41:09] obvs. [21:41:11] <3 [21:41:22] It wouldn't fix stuff on a wiki last week for me :( [21:41:41] ok so only page_id 54466 is the remaining conflict now, right? [21:41:44] (03Merged) 10jenkins-bot: group0 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413051 (owner: 1020after4) [21:42:33] Hauskatze: Yep. All others fixed automagically [21:42:35] But that one wouldn't [21:42:47] I can delete that one via API [21:42:54] pagelinks from=54466 ns=0 dbk=WV:NPOV -> Wikiversity:NPOV DRY RUN [21:42:54] 1 links to fix, 1 were resolvable. [21:43:01] or... I shouldn't... I'm not a sysop there [21:43:11] namespaceDupes is a dupe sometimes :p [21:43:48] https://en.wikiversity.org/wiki/Special:DoubleRedirects <-- the second one is weird [21:44:13] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.31.0-wmf.22 [21:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:19] Hauskatze: namespace changes are *weird* :\ [21:45:30] We could/should just manually resolve that [21:45:38] (ie: me muck in the database) [21:45:57] there was an old script... I don't remember the name [21:46:38] https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/cleanupTitles.php [21:46:45] not sure if that would work [21:47:48] I'm seeing a lot of this: "Fatal error: request has exceeded memory limit in /srv/mediawiki/php-1.31.0-wmf.21/includes/parser/StripState.php on line 137" [21:47:59] not in the new branch, so far, but in the current production branch. [21:49:22] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Readers-Web-Kanbanana-Board, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3987454 (10Pchelolo) [21:49:26] twentyafterfour: Could be a billion things causing that [21:49:41] (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413051 (owner: 1020after4) [21:49:50] I'd see if it was just a temporary spike or something longer [21:49:53] (it's likely nothing) [21:53:30] no_justification: it started abruptly 5 hours ago [21:53:46] Hmmmm, to the SAL! [21:54:37] I don't see anything interesting [21:54:43] Nothing that would affect the parser, really [21:54:51] first event is at 2018-02-20T15:09:19 [21:54:53] twentyafterfour: All one wiki? [21:55:01] (could be some weird bot traffic?) [21:55:43] no_justification: the wiki isn't logged because it's a oom fatal it doesn't include much info [21:56:44] (03PS1) 10Ottomata: Remove kafkatee as a submodule; it will become a regular ops/puppet module [puppet] - 10https://gerrit.wikimedia.org/r/413054 [21:56:46] (03PS1) 10Ottomata: Add kafkatee back in as a regular puppet module (not git submodule) [puppet] - 10https://gerrit.wikimedia.org/r/413055 [21:57:17] T187833 [21:57:17] T187833: Fatal error: request has exceeded memory limit in StripState.php - https://phabricator.wikimedia.org/T187833 [22:00:21] 2018-02-20T14:03:37Synchronized wmf-config/throttle.php: SWAT: [[gerrit:412606|throttle: add new rule for Wikidata edit-a-thon (T187655)]] [22:00:21] T187655: IP unblock requested for 20-30 new accounts being created at University of Edinburgh event. - https://phabricator.wikimedia.org/T187655 [22:00:49] maybe related? I don't see any relevant deployments that coincide with the time of the OOMs beginning [22:00:56] that patch was... mine [22:01:17] is the edit-a-thon happening now? [22:01:23] tomorrow [22:01:36] ok then that's not it [22:01:48] * twentyafterfour was grasping at straws with that one [22:02:09] 'from' => '2018-02-21T08:00 +0:00', [22:02:21] OOMs are kinda hard to debug in PHP :( [22:02:31] I think Krinkle has graphs for this? [22:02:32] <3 [22:02:34] (subtle ping) [22:02:38] (03CR) 10Krinkle: "Please update PrivateSettings.php.example if this key must exist everywhere. This will ensure beta's ability to validate their PrivateSett" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412928 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [22:02:47] so it's not my patch causing that? [22:03:49] no_justification: I do not have graphs that help debug memory problems in PHP (I think) [22:04:18] Having said that, I think XHProf does collect memory usage [22:04:20] :( [22:04:46] So if you can reproduce it with X-Wikimedia-Debug on testwiki or so in a way that doesn't fatal, you might be able to find where a lot of memory is spent. [22:05:32] E.g. on https://performance.wikimedia.org/xhgui/run/view?id=5a860e528eeeb28c27991e09 there is the "inclusive memory peak" [22:06:29] You could then make a similar request on a wiki with versionN and versionN-1 to compare, roughly [22:06:31] will be hard though [22:06:54] (03PS1) 10Ottomata: Remove kafkatee as a submodule and re-add it into ops/puppet preserving history [puppet] - 10https://gerrit.wikimedia.org/r/413056 [22:07:09] (03Abandoned) 10Ottomata: Add kafkatee back in as a regular puppet module (not git submodule) [puppet] - 10https://gerrit.wikimedia.org/r/413055 (owner: 10Ottomata) [22:07:13] (03Abandoned) 10Ottomata: Remove kafkatee as a submodule; it will become a regular ops/puppet module [puppet] - 10https://gerrit.wikimedia.org/r/413054 (owner: 10Ottomata) [22:07:50] https://i.imgur.com/IaSHNXc.png [22:08:01] Hauskatze: it's not your patch [22:08:38] I don't actually know what URL will reproduce this OOM. I'm only seeing it in fatalmonitor, but it's happening a lot [22:11:39] (03CR) 10Ottomata: "I guess this worked? https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10059/console Should be no-op" [puppet] - 10https://gerrit.wikimedia.org/r/413056 (owner: 10Ottomata) [22:12:17] Hauskatze: cleanupTitles.php for enwikiversity would fix 10 pages (according to the dry run) [22:12:19] I can pastebin [22:12:43] no_justification: yep, so we can check [22:13:37] https://phabricator.wikimedia.org/P6723 [22:17:00] no_justification: so afaics it'll move those to NS 0 with Broken/ right? [22:17:20] One minute, looking at something else. [22:17:53] \x3a <-- not sure what's that [22:18:03] (03PS1) 10Rush: openstack: eth1.2120 for labtestneutron200[12] [puppet] - 10https://gerrit.wikimedia.org/r/413057 (https://phabricator.wikimedia.org/T184209) [22:18:04] the output seems to suggest that those were moved already? [22:18:25] It's in dry-run mode, it outputs what it would've done [22:19:28] I can try API-moving one of those [22:19:33] see if that fixes [22:20:06] !log T184209 create labs-instance-transport1-b-codfw [22:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:33] (03CR) 10Rush: [C: 032] openstack: eth1.2120 for labtestneutron200[12] [puppet] - 10https://gerrit.wikimedia.org/r/413057 (https://phabricator.wikimedia.org/T184209) (owner: 10Rush) [22:21:34] no_justification: it works [22:21:36] (Registro de traslados); 22:21 . . MarcoAurelio (discusión | contribuciones | bloquear) trasladó la página Talk:Draft:WikiJournal of Medicine/Rotavirus a Broken/Talk-Draft-WikiJournal of Medicine/Rotavirus sin dejar una redirección ‎(attempt to fix broken titles per phab:T187660) [22:21:36] T187660: en.wikiversity Draft Namespace Inaccessible Pages - https://phabricator.wikimedia.org/T187660 [22:25:57] !log demon@tin Synchronized php-1.31.0-wmf.21/extensions/Thanks/modules/ext.thanks.revthank.js: T187757 (duration: 01m 14s) [22:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:11] T187757: Thanks thanking the wrong edit - https://phabricator.wikimedia.org/T187757 [22:29:33] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3987578 (10Dzahn) I think it should live on something else, not krypton. More like a monitoring server or a dedicated VM. It was once added to krypton with a comment like "... [22:30:18] RoanKattouw: I went ahead and sync'd your backport to Thanks. [22:30:30] It was a pretty disgusting regression [22:30:52] (inb4 Roan says "Thanks" for the Thanks sync) [22:30:53] Thanks [22:30:53] hehehehehe [22:30:59] Called it! [22:31:03] Hahaha you got me [22:31:15] Obvious joke was obvious :) [22:31:18] cc mooeypoo --^^ [22:31:21] 10Operations, 10User-Elukey: Ganeti instances to support Kafka Burrow Consumer lag monitoring - https://phabricator.wikimedia.org/T187805#3987587 (10Dzahn) also, site.pp already looks like this, where everything is in a role except burrows being the oddball which should move: ``` node 'krypton.eqiad.wmnet' {... [22:32:42] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 17.77, 22.06, 23.79 [22:45:06] (03CR) 10MaxSem: [C: 031] Use $wgDBname instead of IDatabase::getDBname in feed config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412837 (owner: 10Krinkle) [22:47:10] twentyafterfour: https://phabricator.wikimedia.org/rMW939faea318d9c2107fab3a584bc1c023f3c592e9 is probably related, but I can't imagine how [22:47:55] maybe someone is actually trying to do the sort of XSS attack that bawolff was trying to protect against? [22:48:13] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 21.36, 22.70, 23.89 [22:49:50] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3987622 (10chasemp) [22:51:02] (03PS6) 10Chico Venancio: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 [22:51:33] 10Puppet, 10cloud-services-team (Kanban): Install hp health tools on labvirts where appropriate - https://phabricator.wikimedia.org/T187355#3987624 (10bd808) [22:52:09] how do i get access to mwlog1001? file a access req and get my manager to approve it? [22:52:18] cscott: or a deliberate DOS thing [22:52:32] subbu: do you not have prod shell access? [22:52:50] that is what you need to get to mwlog1001 (deployer rights) [22:53:00] i have access to tin & parsoid cluster .. but, ssh sastry@mwlog1001.eqiad.wmnet didn't let me through. [22:53:09] unless i am doing it wrong. [22:54:25] subbu: `id ssastry` shows your account there [22:54:48] you have to ssh to it using a jump server, but that should be the same as tin [22:55:34] bd808, never mind .. i typoed .. *facepalm* [22:55:42] :) easy enough to do [23:04:16] (03CR) 10Rush: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio) [23:07:15] (03PS7) 10Chico Venancio: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 [23:08:09] (03PS1) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) [23:09:02] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:09:46] (03PS2) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) [23:10:14] (03CR) 10Jalexander: [C: 031] "This is good from Legal and SuSa perspective, aiming to put up in this afternoon's SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [23:11:06] (03PS3) 10MarcoAurelio: Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) [23:12:33] (03PS8) 10Rush: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio) [23:13:03] (03CR) 10jerkins-bot: [V: 04-1] shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio) [23:22:25] (03PS9) 10Chico Venancio: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) [23:24:10] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3987662 (10chasemp) 05Open>03Resolved https://phabricator.wikimedia.org/T184209#3987660 [23:24:50] (03PS10) 10Chico Venancio: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) [23:25:02] (03PS2) 10EddieGP: Show HTML summaries on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) [23:29:30] (03CR) 10Dzahn: "please see inline comment, shouldn't this replace the existing rewrite line for this?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [23:30:04] (03CR) 10Rush: [C: 031] "I will try to merge or tomorrow or someone who has a moment to babysit onto shinken-01 can :)" [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [23:30:06] (03CR) 10Jalexander: [C: 031] Allow CheckUsers and Stewards to access private data from the AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [23:30:16] (03CR) 10Dzahn: "ok, i see previous comments about it now.. not sure" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [23:30:33] (03CR) 10jerkins-bot: [V: 04-1] shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [23:30:43] (03CR) 10MarcoAurelio: "In the name of all the things Sacred and Holly, please verify once, twice and every time its needed that NOBODY except checkusers everywhe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413062 (https://phabricator.wikimedia.org/T160357) (owner: 10MarcoAurelio) [23:34:15] (03PS11) 10Chico Venancio: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) [23:39:02] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:42:36] (03CR) 10Dzahn: [C: 032] "i tested it with mwdebug1001 and apache-fast-test on tin. i confirmed it didn't affect other things but changed the techblog.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [23:43:05] (03PS6) 10Dzahn: Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [23:44:06] jouncebot: next [23:44:06] In 0 hour(s) and 15 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180221T0000) [23:50:02] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 20.92, 22.37, 23.83 [23:52:53] (03CR) 10Dzahn: "i believe it _might_ have been the point of the check to detect them especially on bastion host to discourage using bastion hosts for this" [puppet] - 10https://gerrit.wikimedia.org/r/412674 (owner: 10Muehlenhoff) [23:53:57] (03CR) 10Dzahn: [C: 032] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/411547 (owner: 10Krinkle) [23:54:03] (03PS2) 10Dzahn: webperf: Add some commments to navtiming test cases [puppet] - 10https://gerrit.wikimedia.org/r/411547 (owner: 10Krinkle)