[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180116T0000). [00:00:04] odder: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:30:48] (03PS46) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [00:36:28] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3901687 (10cy534) @RobH: have you heard anything about the NDA? I received an email from @RStallman-legalteam a few days ago saying that my acknowledgement of N... [00:44:44] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10BBlack) There's a lot else to be said about the subject of the `.wiki` TLD (much of which has been said before on past tickets), and I tend to... [02:24:18] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 06m 10s) [02:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 799.27 seconds [03:59:29] PROBLEM - Nginx local proxy to apache on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:19] RECOVERY - Nginx local proxy to apache on mw2132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.197 second response time [04:02:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 237.74 seconds [04:04:10] 10Operations, 10Datasets-General-or-Unknown, 10Wikidata, 10HHVM, 10Patch-For-Review: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3901918 (10hoo) 05Open>03Resolved a:03hoo >>! In T162245#3901016, @ArielGlenn wrote: > Snapshot hosts are going directl... [04:13:19] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.011 second response time [04:13:20] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [04:13:39] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [04:14:19] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.035 second response time [04:14:20] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [04:14:39] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 74733 bytes in 1.514 second response time [05:46:00] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:58:20] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2047803 [06:12:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Replace db1063 with db1087 as vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404404 [06:12:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Replace db1063 with db1087 as vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404404 [06:15:51] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Replace db1063 with db1087 as vslow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404404 (owner: 10Marostegui) [06:16:00] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:18:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404405 (https://phabricator.wikimedia.org/T174569) [06:20:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404405 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:21:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404405 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:22:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404405 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:23:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 - T174569 (duration: 01m 32s) [06:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:03] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:24:07] !log Upgrade kernel on db1092 [06:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:02] !log Deploy schema change on db1092 - T174569 [06:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:13] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:32:19] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1952 bytes in 0.101 second response time [06:37:19] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1926 bytes in 0.095 second response time [06:42:51] (03PS1) 10Marostegui: s8.hosts: Add dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/404406 (https://phabricator.wikimedia.org/T177208) [06:44:17] (03CR) 10Marostegui: [C: 032] s8.hosts: Add dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/404406 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:45:13] (03Merged) 10jenkins-bot: s8.hosts: Add dbstore2001 [software] - 10https://gerrit.wikimedia.org/r/404406 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [06:48:32] (03CR) 10Marostegui: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/9738/" [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [06:54:43] (03PS1) 10Marostegui: s8.hosts: Add a dbstore and labs servers [software] - 10https://gerrit.wikimedia.org/r/404407 (https://phabricator.wikimedia.org/T174569) [06:58:24] (03CR) 10Marostegui: [C: 032] s8.hosts: Add a dbstore and labs servers [software] - 10https://gerrit.wikimedia.org/r/404407 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:59:20] (03Merged) 10jenkins-bot: s8.hosts: Add a dbstore and labs servers [software] - 10https://gerrit.wikimedia.org/r/404407 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:00:08] (03PS1) 10Marostegui: s8.hosts: Add sanitarium to s8 [software] - 10https://gerrit.wikimedia.org/r/404408 [07:03:19] (03CR) 10Marostegui: [C: 032] s8.hosts: Add sanitarium to s8 [software] - 10https://gerrit.wikimedia.org/r/404408 (owner: 10Marostegui) [07:03:32] !log Deploy schema change on dbstore1002 (s8) - T174569 [07:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:44] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:04:45] (03Merged) 10jenkins-bot: s8.hosts: Add sanitarium to s8 [software] - 10https://gerrit.wikimedia.org/r/404408 (owner: 10Marostegui) [07:07:58] (03PS1) 10Jcrespo: mariadb: Depool db1055 and db1056 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404409 (https://phabricator.wikimedia.org/T183469) [07:08:19] (03PS2) 10Jcrespo: mariadb: Depool db1055 and db1056 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404409 (https://phabricator.wikimedia.org/T183469) [07:09:40] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1055 and db1056 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404409 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [07:10:44] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1055 and db1056 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404409 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [07:12:08] (03Merged) 10jenkins-bot: mariadb: Depool db1055 and db1056 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404409 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [07:12:21] (03CR) 10jenkins-bot: mariadb: Depool db1055 and db1056 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404409 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [07:14:56] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 and db1056 for maintenance (duration: 01m 12s) [07:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:07] !log upgrade and reboot db1055 [07:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404410 (https://phabricator.wikimedia.org/T162807) [07:26:09] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902005 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['mw1339.eqiad.wmnet', 'mw1341.eqiad.wmn... [07:26:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404410 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:28:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404410 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:28:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404410 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:29:03] !log oblivian@neodymium conftool action : set/weight=25; selector: name=mw1340.eqiad.wmnet [07:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 and db1089 - T162807 (duration: 01m 13s) [07:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:53] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:30:00] !log upgrade and reboot db1056 [07:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:43] !log Stop replication in sync db1066 and db1089 - T162807 [07:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:30] !log Deploy schema change on dbstore1001 (s8) - T174569 [07:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:42] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:42:06] !log moving replication topology of x1 replicas [07:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:15] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1055 and db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404412 [07:55:20] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1055 and db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404412 [07:58:54] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1055 and db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404412 (owner: 10Jcrespo) [08:00:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1055 and db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404412 (owner: 10Jcrespo) [08:00:37] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1055 and db1056 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404412 (owner: 10Jcrespo) [08:02:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 and db1056 after maintenance (duration: 00m 49s) [08:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:36] (03PS1) 10Muehlenhoff: Remove access credentials for dpatrick [puppet] - 10https://gerrit.wikimedia.org/r/404413 [08:06:35] (03PS5) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [puppet] - 10https://gerrit.wikimedia.org/r/403678 (https://phabricator.wikimedia.org/T183469) [08:08:33] (03PS2) 10Muehlenhoff: Remove access credentials for dpatrick [puppet] - 10https://gerrit.wikimedia.org/r/404413 [08:09:19] (03PS3) 10Jcrespo: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403454 (https://phabricator.wikimedia.org/T183469) [08:11:43] !log start x1 eqiad master failover [08:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:02] I want exclusivity on puppet and mediawiki deployment for 5 minutes [08:12:48] (03CR) 10Jcrespo: [C: 032] mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403454 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:13:09] sure,I can wait merging my credentials change, just write here when you're done [08:13:10] (03CR) 10jenkins-bot: mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403454 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:13:13] (03CR) 10Jcrespo: [C: 032] dblist: Promote db1055 to be the x1 eqiad master instead of db1031 [software] - 10https://gerrit.wikimedia.org/r/403679 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:13:21] thanks moritzm - will do! [08:13:50] I will log it, this should be only 5 minutes [08:14:01] and this is just in case something horribly wrong happens [08:14:39] (03Merged) 10jenkins-bot: dblist: Promote db1055 to be the x1 eqiad master instead of db1031 [software] - 10https://gerrit.wikimedia.org/r/403679 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:15:10] (03CR) 10Jcrespo: [C: 032] mariadb: Promote db1055 to be the x1 eqiad master instead of db1031 [puppet] - 10https://gerrit.wikimedia.org/r/403678 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [08:16:59] starting with the quick steps now [08:17:07] +1 [08:17:23] !log setting db1031 (x1 master) as read only [08:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:40] I can see it read only now [08:17:57] db1055 caught up in sync [08:18:21] i can see db1031 killed on db1031 \o/ [08:19:00] errors coming up (expected) [08:19:01] deploying change now [08:19:33] we should be done now [08:19:38] errors? [08:19:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Promote db1055 as the new x1 master (duration: 00m 49s) [08:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:03] not happening anymore [08:20:09] there were not many many errors [08:20:13] but there are no more coming up [08:20:36] tendril seems normal [08:21:00] I will reset slave all on db1055 after copyng the coords [08:21:04] yeah, apart from db1055 with replication stopped (of course as it needs to be reseted) all seems good [08:21:07] yeah [08:21:42] I see writes on db1055 finely [08:23:41] I will enable puppet on db1031 and put is as a replica [08:23:50] cool [08:24:40] I will double check firs the coords on the new master [08:25:23] we had 36 connection errors, and I think all from the jobqueue [08:25:55] we had more replication errors, of course == read only [08:25:56] Yeah, there were very very few errors [08:26:41] 1m30s or so of read only [08:26:52] PROBLEM - MariaDB Slave SQL: x1 on db1031 is CRITICAL: CRITICAL slave_sql_state could not connect [08:26:54] mostly due to mediawiki deploy [08:27:22] PROBLEM - MariaDB Slave IO: x1 on db1031 is CRITICAL: CRITICAL slave_io_state could not connect [08:27:35] donwtime has expired [08:28:31] !log master x1 eqiad failover has finished [08:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:13] (03PS3) 10Muehlenhoff: Remove access credentials for dpatrick [puppet] - 10https://gerrit.wikimedia.org/r/404413 [08:31:20] (03PS1) 10Marostegui: db-eqiad.php: Repool db1066, depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404415 (https://phabricator.wikimedia.org/T162807) [08:31:50] PROBLEM - Check whether ferm is active by checking the default input chain on mw1339 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:31:50] PROBLEM - configured eth on mw1339 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:31:51] PROBLEM - MD RAID on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:31:51] PROBLEM - Check size of conntrack table on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:31:53] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for dpatrick [puppet] - 10https://gerrit.wikimedia.org/r/404413 (owner: 10Muehlenhoff) [08:32:00] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:31] PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: connect to address 10.64.32.54 and port 443: Connection refused [08:33:31] PROBLEM - Check systemd state on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:33:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1066, depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404415 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:33:48] moritzm: if it is not 100% clear, we are finished, regarding the cannot stop parts of the x1 failover [08:33:50] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [08:34:13] jynus: thanks, I saw the log and already went ahead :-) [08:34:30] thanks, I just wanted to ping you as requested [08:35:09] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1066, depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404415 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:35:11] PROBLEM - Nginx local proxy to apache on mw1341 is CRITICAL: connect to address 10.64.32.53 and port 443: Connection refused [08:35:11] PROBLEM - mediawiki-installation DSH group on mw1339 is CRITICAL: Host mw1339 is not in mediawiki-installation dsh group [08:36:11] RECOVERY - Nginx local proxy to apache on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.010 second response time [08:36:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066, depool db1099:3311 - T162807 (duration: 01m 12s) [08:36:42] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1066, depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404415 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:53] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:38:01] PROBLEM - HHVM processes on mw1339 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:38:01] PROBLEM - nutcracker port on mw1339 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:38:03] !log Stop replication in sync db1089 - db1099:3311 - T162807 [08:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:51] RECOVERY - Check whether ferm is active by checking the default input chain on mw1339 is OK: OK ferm input default policy is set [08:38:51] RECOVERY - configured eth on mw1339 is OK: OK - interfaces up [08:39:00] RECOVERY - nutcracker port on mw1339 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:39:00] RECOVERY - HHVM processes on mw1339 is OK: PROCS OK: 6 processes with command name hhvm [08:40:21] PROBLEM - mediawiki-installation DSH group on mw1342 is CRITICAL: Host mw1342 is not in mediawiki-installation dsh group [08:40:21] PROBLEM - Disk space on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:40:40] RECOVERY - Check systemd state on mw1342 is OK: OK - running: The system is fully operational [08:40:51] RECOVERY - Check size of conntrack table on mw1342 is OK: OK: nf_conntrack is 0 % full [08:40:51] RECOVERY - MD RAID on mw1342 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [08:41:21] RECOVERY - Disk space on mw1342 is OK: DISK OK [08:43:51] PROBLEM - HHVM rendering on mw1342 is CRITICAL: connect to address 10.64.32.54 and port 80: Connection refused [08:45:41] !log rebooting sarin for kernel security update [08:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:47] !log rearmed key holder on sarin [08:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:21] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3902078 (10Gilles) It turns out to be quite common for load.php calls to take more than a minute: https://logstash.wikimedia.org/goto/7... [08:57:16] (03PS3) 10Ema: vcl: remove X-CP-Full-Cipher [puppet] - 10https://gerrit.wikimedia.org/r/398314 [08:57:26] BTW, this is our first master on stretch/10.1 [08:59:39] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902092 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1341.eqiad.wmnet', 'mw1342.eqiad.wmnet', 'mw1339.eqiad.wmnet'] ``` and were **ALL** suc... [08:59:54] (03CR) 10Jcrespo: [C: 032] mariadb: Point x1-master.eqiad.wmnet to db1055 [dns] - 10https://gerrit.wikimedia.org/r/401712 (https://phabricator.wikimedia.org/T184054) (owner: 10Jcrespo) [08:59:57] (03PS2) 10Jcrespo: mariadb: Point x1-master.eqiad.wmnet to db1055 [dns] - 10https://gerrit.wikimedia.org/r/401712 (https://phabricator.wikimedia.org/T184054) [09:01:05] (03CR) 10Ema: [C: 032] vcl: remove X-CP-Full-Cipher [puppet] - 10https://gerrit.wikimedia.org/r/398314 (owner: 10Ema) [09:08:07] !log upgrade and reboot db1031 after switchover [09:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:15] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1031.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1031.eqiad.wmnet (111 Connection refused) [09:12:08] !log installing krb5 security updates (we're just using rev deps) [09:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:05] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:55] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 74731 bytes in 0.147 second response time [09:14:15] the dbstore is me because x1 changes, ignore it [09:14:23] it will come back in a second [09:15:47] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:16:01] (03CR) 10jerkins-bot: [V: 04-1] restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:16:17] !log installing libxml2 security updates on mw* servers (so that it gets picked up along the HHVM 3.18.7 rollout) [09:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:38] (03PS2) 10Filippo Giunchedi: restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) [09:19:21] (03PS1) 10Marostegui: db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404423 (https://phabricator.wikimedia.org/T162807) [09:19:46] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.972 second response time [09:20:05] RECOVERY - HHVM rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 74719 bytes in 7.006 second response time [09:20:51] (03CR) 10Mobrovac: [C: 031] restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:21:15] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Preparing, (no error: intentional) [09:21:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404423 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:22:57] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404423 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:23:07] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404423 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:23:11] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:24:10] !log oblivian@neodymium conftool action : set/pooled=yes:weight=1; selector: cluster=api_appserver,name=mw13(39|4[12]).eqiad.wmnet [09:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 - T162807 (duration: 01m 12s) [09:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:25:21] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902141 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['mw1343.eqiad.wmnet', 'mw1344.eqiad.wmn... [09:30:03] !log oblivian@neodymium conftool action : set/weight=10; selector: cluster=api_appserver,name=mw13(39|4[12]).eqiad.wmnet [09:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:15] RECOVERY - mediawiki-installation DSH group on mw1339 is OK: OK [09:37:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404425 (https://phabricator.wikimedia.org/T162807) [09:38:08] (03PS1) 10Ema: Revert "vcl: remove X-CP-Full-Cipher" [puppet] - 10https://gerrit.wikimedia.org/r/404426 [09:38:22] <_joe_> !log started refreshLinks additional jobs for commonswiki,ruwiki [09:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:55] (03CR) 10Ema: [C: 032] Revert "vcl: remove X-CP-Full-Cipher" [puppet] - 10https://gerrit.wikimedia.org/r/404426 (owner: 10Ema) [09:39:00] (03PS2) 10Ema: Revert "vcl: remove X-CP-Full-Cipher" [puppet] - 10https://gerrit.wikimedia.org/r/404426 [09:39:03] (03CR) 10Ema: [V: 032 C: 032] Revert "vcl: remove X-CP-Full-Cipher" [puppet] - 10https://gerrit.wikimedia.org/r/404426 (owner: 10Ema) [09:40:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404425 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:40:25] RECOVERY - mediawiki-installation DSH group on mw1342 is OK: OK [09:41:16] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [09:41:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404425 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:41:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404425 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:41:55] ^joe expected? [09:42:16] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:42:37] (03PS2) 10Filippo Giunchedi: prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) [09:42:51] (03CR) 10jerkins-bot: [V: 04-1] prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [09:43:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105:3311 - T162807 (duration: 01m 12s) [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:23] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:44:27] jynus: that's a bug in pybal https://phabricator.wikimedia.org/T184721 [09:45:09] so bad, but expected, and not causing issues? [09:45:21] correct [09:50:50] !log Stop replication in sync db1089 - db1105:3311 - T162807 [09:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:02] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:51:44] _joe_, elukey: FYI, I'll piggyback the nginx update on the HHVM 3.18.7 rollout (https://phabricator.wikimedia.org/T164456#3744448) [09:52:15] ack! [09:52:41] moritzm: big version jump or relatively easy? [09:54:30] 1.11.10->1.13.6 [09:54:58] but once we're on nginx-light it'll be simpler for us [09:55:19] super [09:56:42] !log upgrading canary app servers to HHVM 3.18.7 [09:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:53] 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3902180 (10elukey) [09:58:02] 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3896530 (10elukey) p:05Triage>03Normal [09:58:34] (03PS1) 10Elukey: profile::puppetmaster::puppetdb: add jmx metrics to export [puppet] - 10https://gerrit.wikimedia.org/r/404427 (https://phabricator.wikimedia.org/T184796) [10:00:12] (03CR) 10Alexandros Kosiaris: [C: 031] profile::puppetmaster::puppetdb: add jmx metrics to export [puppet] - 10https://gerrit.wikimedia.org/r/404427 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey) [10:00:31] (03CR) 10Giuseppe Lavagetto: [C: 031] profile::puppetmaster::puppetdb: add jmx metrics to export [puppet] - 10https://gerrit.wikimedia.org/r/404427 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey) [10:01:02] (03CR) 10Elukey: "Tested on af-puppetdb01, list of metrics: https://phabricator.wikimedia.org/P6590" [puppet] - 10https://gerrit.wikimedia.org/r/404427 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey) [10:02:14] (03PS3) 10Filippo Giunchedi: prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) [10:03:05] (03CR) 10Filippo Giunchedi: [C: 031] profile::puppetmaster::puppetdb: add jmx metrics to export [puppet] - 10https://gerrit.wikimedia.org/r/404427 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey) [10:04:14] 10Operations, 10Dumps-Generation: Reboot snapshot*, dumpsdata*, dataset1001, ms1001, francium - https://phabricator.wikimedia.org/T184443#3902205 (10ArielGlenn) Done for: - snapshot1001,5,6 - dataset1001, ms1001 - dumpsdata1002 - francium Waiting for wikidata weeklies to complete before doing: snapshot100... [10:04:52] (03CR) 10Elukey: [C: 032] profile::puppetmaster::puppetdb: add jmx metrics to export [puppet] - 10https://gerrit.wikimedia.org/r/404427 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey) [10:08:56] PROBLEM - DPKG on mw1262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:09:07] PROBLEM - DPKG on mw1264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:09:39] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3902211 (10akosiaris) The only reason I can think of is people still navigating to `grafana-admin` and using since it will still DTRT. As... [10:09:56] RECOVERY - DPKG on mw1262 is OK: All packages OK [10:10:07] RECOVERY - DPKG on mw1264 is OK: All packages OK [10:13:06] !log start cassandra-a on restbase1018 - T184100 [10:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:18] T184100: Reprovision legacy Cassandra nodes into new cluster - https://phabricator.wikimedia.org/T184100 [10:15:11] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902227 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['mw1346.eqiad.wmnet', 'mw1347.eqiad.wmn... [10:22:10] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3902229 (10akosiaris) Talk on IRC suggests `engineering@`. It has 202 subscribers so it's probably a better candidate than `ops@` [10:22:34] (03PS4) 10Filippo Giunchedi: prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) [10:22:36] (03PS1) 10Filippo Giunchedi: prometheus: tweak node_exporter ignored_devices [puppet] - 10https://gerrit.wikimedia.org/r/404430 [10:27:13] (03PS1) 10Giuseppe Lavagetto: mediawiki::scap: fetch mediawiki after mwdeploy has its sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/404431 [10:27:26] <_joe_> elukey, volans ^^ [10:28:27] <_joe_> if we merge it soon, we'll have an immediate feedback from the latest installations [10:29:21] _joe_: looking in couple of minutes [10:31:33] PROBLEM - mediawiki-installation DSH group on mw1345 is CRITICAL: Host mw1345 is not in mediawiki-installation dsh group [10:33:13] PROBLEM - mediawiki-installation DSH group on mw1344 is CRITICAL: Host mw1344 is not in mediawiki-installation dsh group [10:33:47] <_joe_> those are more or less expected ^^, I'll pool those servers soon enough(TM) [10:34:43] PROBLEM - DPKG on mw1344 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:34:43] PROBLEM - dhclient process on mw1344 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:34:53] PROBLEM - mediawiki-installation DSH group on mw1343 is CRITICAL: Host mw1343 is not in mediawiki-installation dsh group [10:35:29] _joe_: I don't see rsync mentioned in Sudo::User['medeploy'], am I looking in the wrong place? [10:35:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1343 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:35:34] PROBLEM - configured eth on mw1343 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:35:34] RECOVERY - dhclient process on mw1344 is OK: PROCS OK: 0 processes with command name dhclient [10:35:34] RECOVERY - DPKG on mw1344 is OK: All packages OK [10:36:14] <_joe_> volans: rsync is ran by scap pull [10:36:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw1343 is OK: OK ferm input default policy is set [10:36:38] RECOVERY - configured eth on mw1343 is OK: OK - interfaces up [10:37:45] _joe_: right, but neither scap is there.. [10:37:56] I'm looking at modules/mediawiki/manifests/users.pp [10:38:40] <_joe_> volans: uhm that is right, I misread the first line probably [10:39:26] <_joe_> no, I was right [10:39:39] <_joe_> the sudo done by scap is actually [10:39:47] <_joe_> sudo -u mwdeploy IIRC [10:39:49] <_joe_> lemme recheck [10:40:15] <_joe_> so yeah, we're doing sudo -u mwdeploy from the user mwdeploy [10:40:34] :( [10:40:55] <_joe_> pull failed: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', ... [10:41:07] ok then [10:41:15] <_joe_> volans: there are reasons for that, but I'd rather merge the fix and confirm it works than explain them now :P [10:41:41] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/404431 (owner: 10Giuseppe Lavagetto) [10:42:14] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::scap: fetch mediawiki after mwdeploy has its sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/404431 (owner: 10Giuseppe Lavagetto) [10:48:18] (03PS2) 10Filippo Giunchedi: Whitelist X-MediaWiki-Patrol-Status header in Swift [puppet] - 10https://gerrit.wikimedia.org/r/402471 (https://phabricator.wikimedia.org/T167400) (owner: 10Gergő Tisza) [10:49:05] (03CR) 10Filippo Giunchedi: [C: 032] Whitelist X-MediaWiki-Patrol-Status header in Swift [puppet] - 10https://gerrit.wikimedia.org/r/402471 (https://phabricator.wikimedia.org/T167400) (owner: 10Gergő Tisza) [10:57:35] !log reboot nescio for kernel security update [10:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:48] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902291 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1345.eqiad.wmnet', 'mw1344.eqiad.wmnet', 'mw1343.eqiad.wmnet'] ``` and were **ALL** suc... [10:59:44] !log roll-restart swift object server - T167400 [10:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:57] T167400: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400 [11:00:18] PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: Return code of 255 is out of bounds [11:02:18] PROBLEM - puppet last run on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:39] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: /dev/shm 118 MB (0% inode=99%): /srv 437898 MB (6% inode=93%) [11:03:58] PROBLEM - Apache HTTP on mw1348 is CRITICAL: connect to address 10.64.32.60 and port 80: Connection refused [11:05:39] PROBLEM - Apache HTTP on mw1346 is CRITICAL: connect to address 10.64.32.58 and port 80: Connection refused [11:05:39] PROBLEM - Check size of conntrack table on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:39] PROBLEM - MD RAID on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:07:28] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: connect to address 10.64.32.60 and port 443: Connection refused [11:07:28] PROBLEM - Check systemd state on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:07:28] PROBLEM - MD RAID on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:07:28] PROBLEM - Check size of conntrack table on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:08:08] !log uploaded HHVM 3.18.7 for stretch-wikimedia to apt.wikimedia.org [11:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:08] PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: connect to address 10.64.32.58 and port 443: Connection refused [11:09:08] PROBLEM - Check systemd state on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:09:08] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:10:48] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:10:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:10:49] PROBLEM - configured eth on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:12:18] !log reboot maerlant for kernel security update [11:12:28] PROBLEM - Check whether ferm is active by checking the default input chain on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:12:28] PROBLEM - configured eth on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:12:28] PROBLEM - DPKG on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:12:29] PROBLEM - dhclient process on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:58] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [11:14:18] PROBLEM - dhclient process on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:14:18] PROBLEM - DPKG on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:14:18] PROBLEM - mediawiki-installation DSH group on mw1348 is CRITICAL: Host mw1348 is not in mediawiki-installation dsh group [11:14:18] PROBLEM - Disk space on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:15:58] PROBLEM - mediawiki-installation DSH group on mw1346 is CRITICAL: Host mw1346 is not in mediawiki-installation dsh group [11:15:58] PROBLEM - Disk space on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:17:09] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:17:38] PROBLEM - HHVM processes on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:17:38] PROBLEM - nutcracker port on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:17:48] PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:19] PROBLEM - HHVM rendering on mw1346 is CRITICAL: connect to address 10.64.32.58 and port 80: Connection refused [11:19:28] PROBLEM - nutcracker process on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:21:38] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [11:21:58] PROBLEM - IPMI Sensor Status on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:20] 10Operations, 10Developer-Relations, 10Discourse: Enable GitHub login in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184986#3902342 (10Qgil) p:05Triage>03Normal [11:23:38] PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:39] PROBLEM - dhclient process on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:25:39] PROBLEM - DPKG on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:25:48] PROBLEM - nutcracker process on mw1348 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:25:49] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:58] RECOVERY - Check size of conntrack table on mw1348 is OK: OK: nf_conntrack is 0 % full [11:25:58] RECOVERY - MD RAID on mw1348 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [11:25:58] RECOVERY - Check whether ferm is active by checking the default input chain on mw1348 is OK: OK ferm input default policy is set [11:25:58] RECOVERY - configured eth on mw1348 is OK: OK - interfaces up [11:26:28] RECOVERY - Disk space on mw1348 is OK: DISK OK [11:26:29] RECOVERY - Check systemd state on mw1348 is OK: OK - running: The system is fully operational [11:26:29] RECOVERY - dhclient process on mw1348 is OK: PROCS OK: 0 processes with command name dhclient [11:26:29] RECOVERY - DPKG on mw1348 is OK: All packages OK [11:26:48] RECOVERY - nutcracker process on mw1348 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [11:26:49] PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:09] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.064 second response time [11:27:33] !log reboot kafka1001 for kernel upgrades [11:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:32] (03PS1) 10Filippo Giunchedi: prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 [11:28:34] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.065 second response time [11:29:05] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 29 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdn] [11:30:15] 10Operations, 10Developer-Relations, 10Discourse: Enable Wikimedia Phabricator login in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184987#3902356 (10Qgil) p:05Triage>03Normal [11:31:49] (03PS2) 10Filippo Giunchedi: prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) [11:32:34] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 74220 bytes in 7.429 second response time [11:33:27] (03PS3) 10Filippo Giunchedi: prometheus: bump global retention to 15 months [puppet] - 10https://gerrit.wikimedia.org/r/404434 (https://phabricator.wikimedia.org/T160677) [11:33:54] PROBLEM - nutcracker port on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:33:54] PROBLEM - HHVM processes on mw1346 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:33:54] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 7.271 second response time [11:34:05] RECOVERY - Disk space on mw1346 is OK: DISK OK [11:34:24] RECOVERY - Check systemd state on mw1346 is OK: OK - running: The system is fully operational [11:34:25] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902372 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1347.eqiad.wmnet ``` The log can be fo... [11:34:31] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1347.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1347.eqiad.wmnet'] ``` [11:34:34] RECOVERY - dhclient process on mw1346 is OK: PROCS OK: 0 processes with command name dhclient [11:34:34] RECOVERY - High CPU load on API appserver on mw1346 is OK: OK - load average: 17.11, 11.68, 6.00 [11:34:34] RECOVERY - DPKG on mw1346 is OK: All packages OK [11:34:44] RECOVERY - Check size of conntrack table on mw1346 is OK: OK: nf_conntrack is 0 % full [11:34:44] RECOVERY - nutcracker process on mw1346 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [11:34:44] RECOVERY - MD RAID on mw1346 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [11:34:45] RECOVERY - nutcracker port on mw1346 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:34:45] RECOVERY - HHVM processes on mw1346 is OK: PROCS OK: 6 processes with command name hhvm [11:36:45] PROBLEM - Apache HTTP on mw1346 is CRITICAL: connect to address 10.64.32.58 and port 80: Connection refused [11:39:07] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1348 is OK: OK: synced at Tue 2018-01-16 11:39:01 UTC. [11:40:11] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), 10User-fgiunchedi: Remove X-Content-Dimensions for multipage originals - https://phabricator.wikimedia.org/T175689#3902378 (10fgiunchedi) [11:40:13] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Find and clear oversized x-content-dimensions headers - https://phabricator.wikimedia.org/T179595#3902376 (10fgiunchedi) 05Open>03Invalid This work has been done as part of {T175689} [11:41:47] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.265 second response time [11:41:56] RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 74220 bytes in 7.394 second response time [11:41:58] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902379 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1347.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1347.eqiad.wmnet'] ``` [11:42:16] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:43:26] RECOVERY - Check whether ferm is active by checking the default input chain on mw1346 is OK: OK ferm input default policy is set [11:46:46] RECOVERY - configured eth on mw1346 is OK: OK - interfaces up [11:48:39] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902386 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1347.eqiad.wmnet ``` The log can be fo... [11:48:42] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902387 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1347.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1347.eqiad.wmnet'] ``` [11:51:03] !log rebooting mc2* hosts for kernel security update [11:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:06] RECOVERY - IPMI Sensor Status on mw1346 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [11:55:45] (03PS3) 10ArielGlenn: permit use of 7zip compressed files for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/399753 (https://phabricator.wikimedia.org/T179267) [11:58:28] (03CR) 10ArielGlenn: [C: 032] permit use of 7zip compressed files for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/399753 (https://phabricator.wikimedia.org/T179267) (owner: 10ArielGlenn) [11:59:36] !log ariel@tin Started deploy [dumps/dumps@c165ca0]: enable 7z prefetch files for page content dumps [11:59:41] !log ariel@tin Finished deploy [dumps/dumps@c165ca0]: enable 7z prefetch files for page content dumps (duration: 00m 04s) [11:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:38] (03PS1) 10ArielGlenn: enable use of 7z files for page content dump prefetch [puppet] - 10https://gerrit.wikimedia.org/r/404437 (https://phabricator.wikimedia.org/T179267) [12:10:56] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1346 is OK: OK: synced at Tue 2018-01-16 12:10:45 UTC. [12:10:58] (03CR) 10ArielGlenn: [C: 032] enable use of 7z files for page content dump prefetch [puppet] - 10https://gerrit.wikimedia.org/r/404437 (https://phabricator.wikimedia.org/T179267) (owner: 10ArielGlenn) [12:14:16] RECOVERY - mediawiki-installation DSH group on mw1348 is OK: OK [12:31:44] PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:34] (03PS1) 10Volans: wmf-auto-reimage: fix host validation logic [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) [12:36:42] (03CR) 10Volans: "To give more context to the reviewers:" [puppet] - 10https://gerrit.wikimedia.org/r/404439 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [12:36:53] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:44] PROBLEM - nutcracker process on mw1347 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:39:44] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [12:40:13] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: connect to address 10.64.32.59 and port 443: Connection refused [12:40:44] PROBLEM - HHVM rendering on mw1347 is CRITICAL: connect to address 10.64.32.59 and port 80: Connection refused [12:40:44] RECOVERY - nutcracker process on mw1347 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:42:53] PROBLEM - Apache HTTP on mw1347 is CRITICAL: connect to address 10.64.32.59 and port 80: Connection refused [12:47:53] RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 74916 bytes in 0.281 second response time [12:47:53] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time [12:49:13] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.068 second response time [12:56:38] !log reboot kafka100[23] for kernel upgrades [12:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:31] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3499956 (10MoritzMuehlenhoff) I rebooted this spare host for completeless wrt Meltdown kernel update and while it's now running the fixed kernel, sshd came up running the /etc/ssh/sshd_config.phabricator instead of the... [13:17:02] RECOVERY - Restbase root url on restbase-dev1004 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.012 second response time [13:18:11] RECOVERY - Restbase root url on restbase-dev1005 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.012 second response time [13:19:02] RECOVERY - Restbase root url on restbase-dev1006 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.031 second response time [13:19:49] 10Operations, 10Kubernetes: Operations 2017-18 Q2 Program 6 umbrella task - https://phabricator.wikimedia.org/T178325#3902489 (10mark) 05Open>03Resolved a:03mark [13:20:44] !log oblivian@neodymium conftool action : set/pooled=yes; selector: cluster=api_appserver,name=mw134[3-7]\.eqiad\.wmnet [13:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:21] (03PS1) 10Faidon Liambotis: Rename role pmacct to netinsights [puppet] - 10https://gerrit.wikimedia.org/r/404441 [13:27:16] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3641820 (10Joe) [13:28:50] _joe_: don't kill me [13:28:57] !log rebooting graphite2001 for kernel security update [13:29:00] (03CR) 10Faidon Liambotis: [C: 032] Rename role pmacct to netinsights [puppet] - 10https://gerrit.wikimedia.org/r/404441 (owner: 10Faidon Liambotis) [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] <_joe_> paravoid: why should I? [13:29:27] for doing include ::role::... from a role [13:29:43] (see above) [13:30:17] <_joe_> oh well, it's actually allowed as one role could want to be a collection of roles [13:30:25] <_joe_> but yeah, that should be a profile indeed [13:31:06] <_joe_> (others use class inheritance for roles, but I think inheritance in puppet is harmful, so...) [13:31:23] !log oblivian@neodymium conftool action : set/weight=25; selector: cluster=api_appserver,name=mw134[3-8[B]\.eqiad\.wmnet [13:31:25] elukey: around? [13:31:31] RECOVERY - mediawiki-installation DSH group on mw1345 is OK: OK [13:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:44] yeah that should be a profile, and yes I agree that inheritance in puppet is a mess [13:32:01] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3902514 (10Joe) 05Open>03Resolved [13:32:02] elukey: trying to understand 95f7dcb85573711495983bd1ce0069f15ebc0216 [13:32:03] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3902515 (10Joe) [13:33:02] paravoid: horrible code? [13:33:07] no [13:33:11] RECOVERY - mediawiki-installation DSH group on mw1344 is OK: OK [13:33:13] well, not that I know of! :) [13:33:33] I'm just trying to figure out why do we need to specify the cluster as kafka-jumbo in role pmacct (now netinsights) [13:33:45] instead of being picked up automatically to some sane default maybe? [13:33:52] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:09] and I also dunno if kafka-jumbo is the right one! :) [13:34:52] RECOVERY - mediawiki-installation DSH group on mw1343 is OK: OK [13:35:08] (03PS1) 10Faidon Liambotis: Rename pmacct to netinsights also in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404442 [13:35:29] so kafka jumbo is the right one since analytics is going to be deprecated (hopefully this quarter or the next), for the default we can definitely use something different! [13:35:34] (03CR) 10Faidon Liambotis: [C: 032] Rename pmacct to netinsights also in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404442 (owner: 10Faidon Liambotis) [13:36:21] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [13:36:43] jesus what am I doing [13:37:23] (03CR) 10Rush: [C: 031] "Thanks filippo for looking at this" [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [13:37:37] elukey: do we need to be overriding the kafka cluster in pmacct? also, does kafka-jumbo have the webrequest stream? [13:37:59] (03CR) 10Rush: [C: 031] mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [13:38:21] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [13:39:07] (03PS1) 10Faidon Liambotis: Drop netinsights' override for Kafka cluster etc. [puppet] - 10https://gerrit.wikimedia.org/r/404443 [13:39:25] hm no, that won't do it either [13:39:32] paravoid: we can have a default, jumbo will hopefully stay for a long time and will be used for these use cases + analytics.. Webrequest data still not there, we need to finish the TLS security review for varnishkafka/kafka before sending traffic (since for the moment we'd like to use TLS only from cp to kafka jumbo, without configuring ipsec) [13:40:06] https://phabricator.wikimedia.org/T182993 [13:40:09] ok, but I want to run kafkatee on the webrequest data now [13:40:31] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3902533 (10chasemp) [13:40:35] 10Operations, 10Cloud-Services, 10cloud-services-team: labvirt1021-1022 spam the dhcp server with requests - https://phabricator.wikimedia.org/T184909#3902535 (10chasemp) [13:40:49] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3867914 (10chasemp) Note from T184909 that these were spamming teh DHCP server ```I've seen this on install1002 via syslog, both servers seem to send multiple DHCPDISCOVER requests ev... [13:41:32] paravoid: if so you need to point kafkatee to the analytics cluster and then switch to jumbo when we'll do the migration [13:41:49] there are two different things running on that box [13:41:57] (or should run on that box) [13:42:00] kafkatee, and pmacct [13:42:27] the former consumes (and may produce), the latter just produces [13:42:55] so I guess the former can do that from analytics, the latter to jumbo [13:43:08] yep, my understanding is that pmacct produces to jumbo netflow data, that we'll then grab from Spark and push to Druid [13:43:24] ok [13:43:27] kafkatee needs to consume from analytics [13:44:21] (since jumbo is kafka 1.0 we have API negotiation working so no need for any librdkafka overrides) [13:44:56] !log rebooting graphite2002 for kernel security update [13:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:23] it feels odd to specify librdkafka_config and cluster etc. in pmacct's hiera [13:45:33] these are really details that shouldn't concern pmacct, right? [13:45:55] (03PS2) 10Faidon Liambotis: Use profile::pmacct, not profile::netinsights [puppet] - 10https://gerrit.wikimedia.org/r/404443 [13:45:58] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3902563 (10chasemp) [13:46:01] 10Operations, 10Cloud-VPS, 10cloud-services-team: Reboot non-labvirt cloud provider hardware for meltdown - https://phabricator.wikimedia.org/T184730#3902565 (10chasemp) [13:46:35] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1004*.wmnet [13:46:45] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10chasemp) [13:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:48] also, looks like we don't have kafkatee for stretch, right? [13:46:48] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3902566 (10chasemp) [13:46:55] I guess I should just build it? [13:46:58] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1004.*.wmnet [13:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:13] (03CR) 10Faidon Liambotis: [C: 032] Use profile::pmacct, not profile::netinsights [puppet] - 10https://gerrit.wikimedia.org/r/404443 (owner: 10Faidon Liambotis) [13:47:28] paravoid: yep! [13:47:34] (03PS1) 10Jgreen: rename host frdb1003 to frav1001, and change main IP back to 10.64.40.71 [dns] - 10https://gerrit.wikimedia.org/r/404444 [13:47:47] should I build it against stretch's librdkafka (0.9) or stretch-backports (0.11)? [13:49:03] (03CR) 10Jgreen: [C: 032] rename host frdb1003 to frav1001, and change main IP back to 10.64.40.71 [dns] - 10https://gerrit.wikimedia.org/r/404444 (owner: 10Jgreen) [13:49:44] paravoid: I am fine with both but if we use 0.11 we need to be really careful to set api versions accordingly (https://github.com/edenhill/librdkafka/wiki/Broker-version-compatibility) when kafkatee consumes from any kafka 0.9 cluster (main eqiad/codfw and analytics) [13:52:07] !log reboot druid1004 for kernel upgrades [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:31] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [13:53:31] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [13:53:48] elukey: I was asking about build-time, not runtime :) but OK, I'll figure it out [13:54:19] !log rebooting graphite1002 for kernel security update [13:54:26] paravoid: sorry then I have no idea :( [13:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:34] elukey: the last kafkatee commit is from you, you bumped the upstream version from 0.1.5 to 0.1.6 -- do you have a specific release process or is it semi-random? [13:56:11] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:11] PROBLEM - dhclient process on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:12] PROBLEM - Disk space on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:21] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:21] PROBLEM - configured eth on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:22] PROBLEM - Check systemd state on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:22] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:56:31] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused [13:56:31] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:32] PROBLEM - MD RAID on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:32] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:56:36] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=druid1004.*.wmnet [13:56:38] gah, sorry that's me, downtime expired [13:56:41] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [13:56:41] PROBLEM - DPKG on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:41] PROBLEM - Check size of conntrack table on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:41] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:42] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:51] PROBLEM - Check whether ferm is active by checking the default input chain on restbase1018 is CRITICAL: Return code of 255 is out of bounds [13:57:46] (03PS4) 10Zfilipin: Revert "Restrict sending mails to new users" config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403571 (https://phabricator.wikimedia.org/T184470) (owner: 10Dmaza) [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180116T1400). [14:00:04] dmaza: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] I can SWAT today [14:00:22] dmaza: around for SWAT? [14:00:26] hello [14:00:41] yup.. I'm here [14:00:42] dmaza: do you want to deploy yourself, or should I do it? [14:01:04] you should do it. I don't believe I can do it myself [14:01:13] !log rebooting labtest* hosts for kernel security update [14:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:28] dmaza: ok, I'll ping you in a few minutes, when the patch is at mwdebug1002 [14:01:32] do you know how to test there? [14:01:38] I do [14:01:40] thanks [14:01:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403571 (https://phabricator.wikimedia.org/T184470) (owner: 10Dmaza) [14:01:57] then we are all set! :) [14:03:19] (03Merged) 10jenkins-bot: Revert "Restrict sending mails to new users" config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403571 (https://phabricator.wikimedia.org/T184470) (owner: 10Dmaza) [14:03:35] (03CR) 10jenkins-bot: Revert "Restrict sending mails to new users" config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403571 (https://phabricator.wikimedia.org/T184470) (owner: 10Dmaza) [14:04:30] (03PS1) 10Rush: tools: add arturo to existing shinken [puppet] - 10https://gerrit.wikimedia.org/r/404446 [14:05:08] (03CR) 10jerkins-bot: [V: 04-1] tools: add arturo to existing shinken [puppet] - 10https://gerrit.wikimedia.org/r/404446 (owner: 10Rush) [14:06:17] dmaza: the patch is at mwdebug1002, please test and let me know if I can deploy [14:06:29] testing [14:07:03] (03PS2) 10Rush: tools: add arturo to existing shinken [puppet] - 10https://gerrit.wikimedia.org/r/404446 (https://phabricator.wikimedia.org/T178807) [14:08:03] (03CR) 10Rush: [C: 032] tools: add arturo to existing shinken [puppet] - 10https://gerrit.wikimedia.org/r/404446 (https://phabricator.wikimedia.org/T178807) (owner: 10Rush) [14:08:42] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3902655 (10Dzahn) No, no point i debugging indeed. Instead it would be really nice if it could be shutdown after running such a long time doing nothing. [14:09:06] !log bootstrap cassandra-b on restbase1018 [14:09:12] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3902656 (10Dzahn) [14:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:01] RECOVERY - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-b valid until 2018-08-17 16:11:35 +0000 (expires in 213 days) [14:10:18] 10Operations, 10Developer-Relations, 10Discourse: Enable GitHub login in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184986#3902658 (10revi) Organization account (gh/wikimedia) can make OAuth Apps. Owner access required, though. (Settings>(Developer settings)>OAuth apps> register an... [14:10:22] zeljkof, looks good [14:10:32] dmaza: ok, deploying... [14:11:55] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:403571|Revert "Restrict sending mails to new users" config change (T184470)]] (duration: 01m 13s) [14:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:09] T184470: Rollback and clean up code from T178842 - https://phabricator.wikimedia.org/T184470 [14:12:16] dmaza: deployed! please check and thanks for deploying with #releng! ;) [14:12:18] !log reboot druid100[56] for kernel upgrades [14:12:28] zeljkof, thank you [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:09] !log EU SWAT finished [14:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:21] zeljkof, everything looks good.. Thanks again [14:13:36] great! [14:16:01] RECOVERY - mediawiki-installation DSH group on mw1346 is OK: OK [14:18:36] !log powercycling labtestservices2001, stuck in reboot [14:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] elukey: ? [14:21:00] elukey: also, do we have any trusty/kafkatee users in our infrastructure? [14:23:56] paravoid: sorry I didn't see the previous ping [14:24:39] about trusty, I'd ask to Jeff, he is running the only instance that I have no idea where/how it runs [14:25:17] !log powercycling labtestservices2003, stuck in reboot [14:25:24] (03CR) 10jenkins-bot: ClusterShell backend: fix execute() return code [software/cumin] - 10https://gerrit.wikimedia.org/r/399829 (owner: 10Volans) [14:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:54] paravoid: about the release process, I don't remember exactly what I did last time but iirc I just built the package on copper and that's it [14:27:04] (so I haven't followed any specific procedure) [14:28:50] (03CR) 10jenkins-bot: ClusterShell backend: fix execute() return code [software/cumin] - 10https://gerrit.wikimedia.org/r/399829 (owner: 10Volans) [14:31:01] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/9743/" [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [14:31:07] (03PS5) 10Filippo Giunchedi: prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) [14:31:54] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [14:32:22] (03PS13) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [14:32:39] hashar: any clue why jenkins published results on a cumin patch that was merged 2w ago? [14:32:54] see few lines above [14:32:58] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:33:42] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:34:23] sorry late downtime for --^ [14:35:35] (03CR) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:35:41] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational [14:35:48] !log rebooting graphite1003 for kernel security update [14:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:57] (03PS14) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [14:37:53] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:52:43] !log rebooting graphite1001 for kernel security update [14:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404452 [14:56:06] (03PS1) 10Giuseppe Lavagetto: site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 [14:56:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404452 (owner: 10Marostegui) [14:58:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404452 (owner: 10Marostegui) [14:58:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404452 (owner: 10Marostegui) [14:58:52] (03PS1) 10Filippo Giunchedi: install_server: switch remaining cassandra hosts to jbod [puppet] - 10https://gerrit.wikimedia.org/r/404455 (https://phabricator.wikimedia.org/T184100) [14:59:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 - T162807 (duration: 01m 09s) [14:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:42] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [14:59:53] elukey: can you review/merge the changes I just pushed to analytics/kafkatee? [15:00:12] or ottomata[m] [15:00:36] paravoid: I can later on if it is not a problem [15:00:47] (otherwise if it is urgent I'll do it now) [15:00:50] it's high priority, but can wait for later today, sure [15:01:05] I built the package and will include it in stretch-wikimedia in the meantime [15:01:07] (03CR) 10Filippo Giunchedi: [C: 032] install_server: switch remaining cassandra hosts to jbod [puppet] - 10https://gerrit.wikimedia.org/r/404455 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [15:03:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404466 (https://phabricator.wikimedia.org/T174569) [15:03:56] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:06:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404466 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:08:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404466 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:08:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317,3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404466 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [15:10:01] paravoid: done :) [15:10:08] thanks for the refactoring! [15:10:09] haha [15:10:10] thanks :) [15:10:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 db1101:3318 for schema change, mariadb upgrade and kernel upgrade - T162807 (duration: 01m 12s) [15:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:30] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:11:35] !log Upgrade mariadb and kernel on db1101 [15:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:43] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3902826 (10chasemp) [15:17:17] (03CR) 10Faidon Liambotis: [C: 031] Migration to Python 3 [software/cumin] - 10https://gerrit.wikimedia.org/r/402059 (owner: 10Volans) [15:18:06] (03CR) 10Faidon Liambotis: [C: 031] PuppetDB backend: add support for API v4 [software/cumin] - 10https://gerrit.wikimedia.org/r/399821 (https://phabricator.wikimedia.org/T182575) (owner: 10Volans) [15:26:31] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3902896 (10Fjalapeno) [15:30:01] (03PS1) 10Marostegui: db-eqiad.php: Repool db1101:3317 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404471 [15:30:17] 10Operations, 10HHVM: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902917 (10Joe) p:05Triage>03Normal [15:30:37] 10Operations, 10HHVM, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902930 (10Joe) a:05ayounsi>03Joe [15:30:39] !log Deploy schema change on db1101:3318 - T174569 [15:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:52] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [15:31:09] !log rebooting acamar for kernel security update [15:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3902934 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete. [15:32:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1101:3317 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404471 (owner: 10Marostegui) [15:33:09] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3902940 (10Marostegui) Thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Rebuilding) ``` [15:33:52] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1101:3317 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404471 (owner: 10Marostegui) [15:34:13] 10Operations, 10Developer-Relations, 10Discourse: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#3883546 (10Bkybala9) I am a seasoned forum owner and user. Have used all php based forum software infact was doing my GSOC last year for phpBB. I must admit that I am... [15:34:48] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3902958 (10Volans) [15:34:50] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3902956 (10Volans) 05Open>03Resolved [15:36:36] (03PS1) 10Jcrespo: compare.py: Implement parallel queries between servers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404472 [15:37:02] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1101:3317 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404471 (owner: 10Marostegui) [15:38:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1101:3317 after kernel upgrade (duration: 01m 12s) [15:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:13] (03CR) 10Faidon Liambotis: [C: 04-1] "Cool! That's indeed better. See inline for a bunch (mostly more Python-style) comments :)" (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [15:41:19] 10Operations, 10monitoring, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027#3902972 (10fgiunchedi) In a prometheus world `cadvisor` seems to be doing what we want (i.e. export cgroup statistics, including systemd cgroups). After enabli... [15:41:33] !log rebooting achernar for kernel security update [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:29] 10Operations, 10Developer-Relations, 10Discourse: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#3902978 (10Qgil) Hi @Bkybala9, thank you for your offer to help, but we have still a long way before we can start discussing this migration. See all the list of blockin... [15:44:45] !log reboot druid1003 for kernel upgrades [15:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:07] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404478 [15:51:14] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/9745/ this is a noop to all effects." [puppet] - 10https://gerrit.wikimedia.org/r/404453 (owner: 10Giuseppe Lavagetto) [15:52:41] (03CR) 10Herron: [C: 032] puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/398484 (https://phabricator.wikimedia.org/T182585) (owner: 10Andrew Bogott) [15:52:47] (03PS2) 10Herron: puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/398484 (https://phabricator.wikimedia.org/T182585) (owner: 10Andrew Bogott) [15:52:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404478 (owner: 10Marostegui) [15:53:08] 10Operations, 10Developer-Relations, 10Discourse, 10Release-Engineering-Team: Enable GitHub login in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184986#3903020 (10Qgil) Thanks! It seems that #release-engineering-team has access to Wikimedia's GitHub account. Adding them to the loop. [15:53:47] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3903022 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request is... [15:54:39] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404478 (owner: 10Marostegui) [15:56:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1101:3317 after kernel upgrade (duration: 01m 12s) [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404478 (owner: 10Marostegui) [15:57:18] PROBLEM - Check whether ferm is active by checking the default input chain on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:57:18] PROBLEM - dhclient process on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:57:31] !log rebooting labvirt1001 [15:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:52] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:57:53] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:57:53] PROBLEM - Disk space on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:57:54] PROBLEM - Check systemd state on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:00] aannd that's me again [15:58:02] PROBLEM - MD RAID on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:03] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:03] PROBLEM - DPKG on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:12] PROBLEM - Check size of conntrack table on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:12] PROBLEM - configured eth on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:13] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:58:13] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Return code of 255 is out of bounds [15:59:13] RECOVERY - Long running screen/tmux on graphite1001 is OK: OK: No SCREEN or tmux processes detected. [16:00:33] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:12] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:12] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:13] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:13] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:23] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:42] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:02] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:10] ah! nitrogen failure? [16:02:23] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:42] PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:42] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:44] again? [16:02:56] yep [16:03:06] (03PS1) 10Herron: Revert "puppet.conf: replace configtimeout" [puppet] - 10https://gerrit.wikimedia.org/r/404479 [16:03:08] elukey: can we force tegmen to have a different puppet crontab time of einsteinium? [16:03:12] !log rebooting labweb* hosts for kernel security update [16:03:14] s/force/easily force/ [16:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:43] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:52] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:52] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:52] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:52] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:52] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:04:13] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:04:52] !log add arturo to acl*operations-team [16:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:21] volans: we could, but is there any indication that this could be the issue? (asking because of ignorance) [16:05:34] (03CR) 10Herron: [C: 032] Revert "puppet.conf: replace configtimeout" [puppet] - 10https://gerrit.wikimedia.org/r/404479 (owner: 10Herron) [16:05:52] RECOVERY - nutcracker process on labweb1001 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [16:06:00] (03PS1) 10Herron: Revert "Revert "puppet.conf: replace configtimeout"" [puppet] - 10https://gerrit.wikimedia.org/r/404480 [16:06:02] RECOVERY - nutcracker port on labweb1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:06:08] elukey: no strong indication, but it could, the time seems to alingn at first sight and in general is wrong to have 100% of hosts of a cluster running puppet at the same time [16:06:25] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404482 [16:06:44] in particular for Icinga hosts that are pretty heavy on the puppet run side ;) [16:06:57] volans: ack, seems good [16:07:02] RECOVERY - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.99 port 9042 [16:07:33] RECOVERY - Apache HTTP on labweb1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 620 bytes in 0.091 second response time [16:07:42] RECOVERY - HHVM rendering on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 75175 bytes in 0.212 second response time [16:08:14] !log bootstrap cassandra-c on restbase1018 [16:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:43] RECOVERY - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-c valid until 2018-08-17 16:11:36 +0000 (expires in 213 days) [16:09:46] !log rebooting praseodymium for kernel security update [16:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:00] <_joe_> !log depooling mw1201-1208 from the API cluster, T185004 [16:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:11] T185004: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004 [16:10:14] (03PS1) 10Andrew Bogott: nova compute: whitelist meltdown-safe kernel version [puppet] - 10https://gerrit.wikimedia.org/r/404484 [16:10:52] (03CR) 10Andrew Bogott: [C: 032] nova compute: whitelist meltdown-safe kernel version [puppet] - 10https://gerrit.wikimedia.org/r/404484 (owner: 10Andrew Bogott) [16:11:04] !log oblivian@neodymium conftool action : set/pooled=no; selector: cluster=api_appserver,name=mw120[1-8]\.eqiad\.wmnet [16:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404482 (owner: 10Marostegui) [16:14:32] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:20] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404482 (owner: 10Marostegui) [16:16:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1101:3317 (duration: 01m 12s) [16:16:50] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404482 (owner: 10Marostegui) [16:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:02] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:20:42] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active [16:20:59] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational [16:21:09] RECOVERY - cassandra-a service on restbase1018 is OK: OK - cassandra-a is active [16:21:09] RECOVERY - MD RAID on restbase1018 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [16:21:10] RECOVERY - DPKG on restbase1018 is OK: All packages OK [16:21:10] RECOVERY - Check size of conntrack table on restbase1018 is OK: OK: nf_conntrack is 3 % full [16:21:10] RECOVERY - configured eth on restbase1018 is OK: OK - interfaces up [16:21:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404486 [16:21:19] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active [16:21:19] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [16:21:22] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404486 (owner: 10Marostegui) [16:21:29] RECOVERY - Check whether ferm is active by checking the default input chain on restbase1018 is OK: OK ferm input default policy is set [16:21:29] RECOVERY - dhclient process on restbase1018 is OK: PROCS OK: 0 processes with command name dhclient [16:22:40] RECOVERY - Disk space on restbase1018 is OK: DISK OK [16:22:55] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404487 (https://phabricator.wikimedia.org/T174569) [16:23:10] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404486 (owner: 10Marostegui) [16:25:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404487 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:26:23] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase1018 is OK: OK: synced at Tue 2018-01-16 16:26:16 UTC. [16:27:56] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404487 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:28:06] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404487 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:28:43] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:28:43] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:28:53] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:28:53] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:28:53] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:29:14] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1092 - T174569 (duration: 01m 08s) [16:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:44] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:30:33] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:37] (03PS2) 10Milimetric: Add two schemas to the whitelist [puppet] - 10https://gerrit.wikimedia.org/r/402907 [16:31:13] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:13] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:14] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:14] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:24] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:43] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:54] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:32:02] ottomata[m]: ^ i swapped you into the topic [16:32:06] since you are on duty this week [16:32:24] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:32:33] (03CR) 10Volans: "I just did a quick pass on the general approach, see my comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [16:32:37] (03PS3) 10Milimetric: Add two schemas to the whitelist [puppet] - 10https://gerrit.wikimedia.org/r/402907 [16:32:43] RECOVERY - puppet last run on dysprosium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:43] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:33:36] (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/402907 (owner: 10Milimetric) [16:33:53] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:33:53] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:34:43] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404492 [16:35:05] (03PS4) 10Elukey: profile::mariadb::misc::eventlogging: add two schemas to the whitelist [puppet] - 10https://gerrit.wikimedia.org/r/402907 (owner: 10Milimetric) [16:35:13] (03PS5) 10Elukey: profile::mariadb::misc::eventlogging: add two schemas to the whitelist [puppet] - 10https://gerrit.wikimedia.org/r/402907 (owner: 10Milimetric) [16:37:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404492 (owner: 10Marostegui) [16:38:35] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404492 (owner: 10Marostegui) [16:38:49] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404492 (owner: 10Marostegui) [16:40:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1092 (duration: 01m 12s) [16:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:20] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404494 [16:48:43] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:53] PROBLEM - Check whether ferm is active by checking the default input chain on mw2140 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:49:03] PROBLEM - Check systemd state on mw2140 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:50:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404494 (owner: 10Marostegui) [16:52:21] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404494 (owner: 10Marostegui) [16:52:34] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404494 (owner: 10Marostegui) [16:53:34] PROBLEM - mediawiki-installation DSH group on mw2140 is CRITICAL: Host mw2140 is not in mediawiki-installation dsh group [16:53:43] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:53:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1092 (duration: 01m 12s) [16:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog, moritzm, and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180116T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:09] (03PS2) 10Giuseppe Lavagetto: site.pp: reorganize appservers in eqiad by function/row [puppet] - 10https://gerrit.wikimedia.org/r/404453 [17:00:10] (03PS1) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row [puppet] - 10https://gerrit.wikimedia.org/r/404498 [17:00:12] (03PS1) 10Giuseppe Lavagetto: site.pp: decommission mw1201-1208 [puppet] - 10https://gerrit.wikimedia.org/r/404499 (https://phabricator.wikimedia.org/T185004) [17:00:14] (03PS1) 10Marostegui: db-eqiad.php: Restore db1092 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404501 [17:00:17] (03PS1) 10Giuseppe Lavagetto: site.pp: decommission mw1209-1220 [puppet] - 10https://gerrit.wikimedia.org/r/404500 (https://phabricator.wikimedia.org/T185004) [17:03:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1092 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404501 (owner: 10Marostegui) [17:03:32] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::eventlogging: add two schemas to the whitelist [puppet] - 10https://gerrit.wikimedia.org/r/402907 (owner: 10Milimetric) [17:04:12] milimetric: merged! [17:04:33] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1092 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404501 (owner: 10Marostegui) [17:04:57] thx elukey [17:06:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1092 original weight (duration: 01m 12s) [17:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:48] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1092 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404501 (owner: 10Marostegui) [17:08:49] (03PS2) 10RobH: Add bawolff to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/403430 (https://phabricator.wikimedia.org/T184582) [17:08:59] (03PS3) 10RobH: Add bawolff to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/403430 (https://phabricator.wikimedia.org/T184582) [17:09:33] (03CR) 10Zoranzoki21: [C: 031] Add bawolff to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/403430 (https://phabricator.wikimedia.org/T184582) (owner: 10RobH) [17:10:14] (03CR) 10RobH: [C: 032] Add bawolff to additional groups [puppet] - 10https://gerrit.wikimedia.org/r/403430 (https://phabricator.wikimedia.org/T184582) (owner: 10RobH) [17:11:24] !log upgrading and rebooting labvirt1002 [17:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:33] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 524 bytes in 3.970 second response time [17:15:43] PROBLEM - Host labvirt1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:23] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/redis - 356 bytes in 60.019 second response time [17:18:53] RECOVERY - Host labvirt1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:24:06] thanks paravoid volans for the code review [17:24:33] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.570 second response time [17:24:50] arturo: yw :) feel free to ask any question anytime [17:24:57] indeed! same here :) [17:26:03] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.022 second response time [17:31:23] !log rebooting labvirt1004 [17:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:43] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 255 bytes in 3.300 second response time [17:35:23] RECOVERY - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is OK: TCP OK - 0.000 second response time on 10.64.48.100 port 9042 [17:41:11] wikibugs a bit delayed? [17:41:16] don't see some updates to CRs [17:43:14] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.007 second response time [17:45:44] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.923 second response time [17:45:52] !log disabled puppet agents troubleshooting T184444 [17:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:03] T184444: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444 [17:47:14] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.009 second response time [17:49:10] Krinkle: o/ - me and ottomata were wondering if you know anything about brrd running on eventlog1001 [17:49:16] (https://github.com/wikimedia/operations-software-brrd) [17:49:26] it is spamming a lot syslog and dmesg [17:49:45] it is run via upstart and not puppetized, so probably old [17:49:51] do you guys still use it? [17:50:08] !log rebooting labvirt1005 [17:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:00] !log re-enabled puppet agents [17:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:34] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 532 bytes in 3.034 second response time [17:53:24] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:58:03] RECOVERY - Host labvirt1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:58:34] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [17:59:06] !log rebooting labnet100[34] and labcontrol100[34] for kernel security update [17:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:34] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180116T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:14] No ORES today. [18:03:43] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.479 second response time [18:04:10] Reedy: i was in the middle of downloading "*.zip.024" :o [18:04:26] like back in the 90s, like RARs :p [18:04:59] then i realized.. well maybe you should check the last patch set you uploaded when they put it on dropbox.. it's probably not merged yet [18:05:07] and i see you did PS :)) [18:05:09] 2 [18:05:18] andrewbogott: I'm going to silence tools.checker for awhile [18:05:25] ok [18:05:32] but fyi we should validate it's green all around in icinga before an all clear [18:06:49] andrewbogott: direct link fyi https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org [18:09:33] RECOVERY - HP RAID on db2036 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [18:12:44] PROBLEM - Host labvirt1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:53] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 1.000 second response time [18:15:53] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 578 bytes in 0.165 second response time [18:17:11] !log arlolra@tin Started deploy [parsoid/deploy@1026fd2]: Updating Parsoid to 231bfff [18:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:51] !log rebooting labmon1002 for kernel security update [18:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:03] RECOVERY - Host labvirt1006 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:25:38] !log removing ganeti VM puppetcompiler1001 [18:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:25] !log arlolra@tin Finished deploy [parsoid/deploy@1026fd2]: Updating Parsoid to 231bfff (duration: 13m 13s) [18:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:14] herron: puppetcompiler1001 is also listed in site.pp, BTW (not sure whether you're removing it for good or just replacing it with a reimage) [18:34:31] moritzm: thanks, it’s been removed from ganeti for good. working on patches to remove it from puppet, dhcp, dns, etc. now [19:14:16] arturo: something like this: https://etherpad.wikimedia.org/p/volans-tmp3 [19:14:50] you can't do that [19:15:12] you will see something like this [19:15:19] https://www.irccloud.com/pastebin/YZeFEC6J/ [19:15:47] why? apply accept only one parameter [19:16:06] but you are giving it 2, right? [19:16:06] no [19:16:29] I'm instantiating the filter class with a constructor to which I pass a parameter [19:16:33] the apply() is not touched [19:16:43] and follow the signature required by the API [19:17:08] I see 2: 'self' and 'pkg'. I tried only with 'pkg' and it doesn't work either [19:17:28] I mean, I tested before exactly your same code with no luck [19:17:36] I've just run it [19:17:56] self is the parameter required by any instance method in Python classes [19:18:08] that refers to the current object, is passed automatically by Python itself [19:18:31] ok I will check again tomorrow. I probably missed the right signature in the class declaration [19:20:40] arturo: ok, I've also checked in the source of python-apt that the Filter object doesn't have any __init__ defined [19:21:07] to be more clean you should also add a call to the parent constructor, just inc ase they'll add one in the future [19:21:53] * volans updated the etherpad (assuming python3) [19:22:41] * volans off for dinner, we can continue tomorrow ;) [19:22:56] sure thanks volans :-) [19:23:01] yw [19:26:12] 10Operations, 10Education-Program-Dashboard, 10Programs-and-Events-Dashboard-Sprint 2, 10Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3903751 (10Ottomata) p:05Triage>03Normal [19:26:44] hey, what do you think. ganeti VM for "prod bots" and put wikibugs and icinga-wm on it [19:26:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3903753 (10Marostegui) 05Open>03Resolved All good! Thanks a lot Papaul! ``` root@db2036:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 001438031205FF0) Port... [19:27:40] 10Operations, 10media-storage: Two cases of local-multiwrite storage backend failure - https://phabricator.wikimedia.org/T174269#3903774 (10Ottomata) p:05Triage>03Normal [19:27:55] mutante i thought icinga-wm is hosted on a prod machine? But wikibugs is not. [19:28:36] 10Operations, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172#3903777 (10Ottomata) p:05Triage>03Low [19:29:42] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/402665 (https://phabricator.wikimedia.org/T184778) (owner: 10Paladox) [19:29:56] (03PS4) 1020after4: Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [19:29:58] 10Operations, 10Mail, 10Wikidata: Large number of "A page you created was linked on Wikidata" emails to one recipient in short period of time - https://phabricator.wikimedia.org/T177099#3903781 (10Ottomata) p:05Triage>03Low [19:30:20] !log restarted wikibugs (several attempts, eventually it worked) [19:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:44] (03CR) 1020after4: [C: 031] Scap canary: cache last good deploy time [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [19:30:48] paladox: eh, that's right! i should have said wikibugs and stashbot [19:31:02] Yep. [19:31:25] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3903786 (10Ottomata) p:05Triage>03Normal [19:31:53] 10Operations, 10ops-esams: Degraded RAID on bast3002 - https://phabricator.wikimedia.org/T183814#3903790 (10Ottomata) p:05Triage>03Normal [19:32:23] 10Operations, 10monitoring, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454#3903793 (10Ottomata) p:05Triage>03Normal [19:32:39] 10Operations, 10ops-eqiad, 10hardware-requests, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#3903795 (10Ottomata) p:05Triage>03Normal [19:32:45] 10Operations, 10ops-eqiad, 10Analytics: Decommission kafka1018 - https://phabricator.wikimedia.org/T182955#3903797 (10Ottomata) p:05Triage>03Normal [19:33:50] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3903799 (10Ottomata) p:05Triage>03Normal [19:34:25] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3903800 (10Ottomata) p:05Triage>03Normal [19:34:37] 10Operations: Use firmware-enriched Debian installation images - https://phabricator.wikimedia.org/T182699#3903801 (10Ottomata) p:05Triage>03Normal [19:34:42] 10Operations, 10Developer-Relations, 10Discourse: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3903802 (10Framawiki) [19:34:54] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331#3903803 (10Ottomata) p:05Triage>03Low [19:35:02] 10Operations, 10ORES, 10Scoring-platform-team: Tuning profile::ores::celery parameters should cause a Celery service restart - https://phabricator.wikimedia.org/T182203#3903804 (10Ottomata) p:05Triage>03Normal [19:35:12] 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3903805 (10Ottomata) p:05Triage>03Normal [19:35:37] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3903823 (10Ottomata) p:05Triage>03Normal [19:35:48] 10Operations, 10Operations-Software-Development: DNS repo: add CI checks for obvious configuration errors - https://phabricator.wikimedia.org/T182028#3903824 (10Ottomata) p:05Triage>03Normal [19:36:50] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988#3903825 (10Ottomata) p:05Triage>03Normal [19:36:56] 10Operations, 10Puppet, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967#3903826 (10Ottomata) p:05Triage>03Normal [19:39:10] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3903847 (10Ottomata) p:05Triage>03Normal Just curious, why not use git fat? We have a git-fat store available already, and it can be used by... [19:39:19] 10Operations, 10Traffic, 10media-storage: "Error: 404, Requested domainname does not exist" when accessing Commons categories/images; works on mobile page - https://phabricator.wikimedia.org/T181801#3903849 (10Ottomata) p:05Triage>03Normal [19:39:30] 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3903850 (10Ottomata) p:05Triage>03Normal [19:39:44] 10Operations, 10Page Content Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3903853 (10Ottomata) p:05Triage>03Normal [19:39:56] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3903854 (10Ottomata) p:05Triage>03Normal [19:40:02] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#3903855 (10Ottomata) p:05Triage>03Normal [19:41:43] 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a/b/c-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3903873 (10RobH) Should the frack (c1-eqiad) have an EX4300 placed in it, or is it excluded from the row switch stack? We ordered 14 EX4300, and the digram calls for 5 EX4300 per row, unle... [19:41:57] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3903877 (10atgo) And also nuria.last_access_uniques_daily_asiacell [19:42:15] (03PS2) 10Arlolra: parsoid::testing: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/404063 (owner: 10Dzahn) [19:42:16] (03PS7) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 [19:42:27] 10Operations, 10monitoring: Puppet fail to properly refresh Icinga - https://phabricator.wikimedia.org/T184714#3903881 (10Ottomata) p:05Triage>03Normal I suppose a restart without a configcheck would be dangerous, right? So just changing the subscribe behavior of the puppet service isn't quite right. Sho... [19:42:44] 10Operations, 10Wikimedia-Logstash: logstash group1 dashboard incorrectly shows testwikidatawiki - https://phabricator.wikimedia.org/T184655#3903885 (10Ottomata) p:05Triage>03Normal [19:42:55] 10Operations, 10ops-esams: To purchase for next esams visit - https://phabricator.wikimedia.org/T184522#3903892 (10Ottomata) p:05Triage>03Normal [19:43:36] (03CR) 10Arlolra: "Thanks Dzahn, I rebased on your patch" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [19:43:43] 10Operations, 10Puppet: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#3903899 (10Ottomata) p:05Triage>03High [19:44:01] 10Operations, 10Cloud-VPS, 10DNS, 10Traffic, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#3903900 (10Ottomata) p:05Triage>03Normal [19:44:04] (03CR) 10Arlolra: [C: 031] parsoid::testing: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/404063 (owner: 10Dzahn) [19:45:04] (03CR) 10Dzahn: "thanks! you keep finding all my mistakes, i created this with sed -i .. i will amend and then what i'll do is just run it on one of the i" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [19:45:05] no_justification: I'm going to start rolling forward wmf.16 now, FYI so I don't step on your toes. [19:45:13] Ok, you're fine [19:45:15] (03PS1) 10Madhuvishy: shinken: Point irc notify log file to wikimedia-cloud-feed [puppet] - 10https://gerrit.wikimedia.org/r/404530 [19:46:07] (03PS3) 10Dzahn: parsoid::testing: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/404063 [19:46:17] (03CR) 10Madhuvishy: [C: 032] shinken: Point irc notify log file to wikimedia-cloud-feed [puppet] - 10https://gerrit.wikimedia.org/r/404530 (owner: 10Madhuvishy) [19:46:38] !log thcipriani@tin Synchronized php-1.31.0-wmf.16/includes/Storage/RevisionStore.php: [[gerrit:403930|RevisionStore, fix loadSlotContent with no $blobFlags]] T184749 (duration: 01m 13s) [19:46:43] (03CR) 10Dzahn: [C: 032] parsoid::testing: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/404063 (owner: 10Dzahn) [19:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:54] T184749: Every edit (including rollback) distorts non-ASCII text - https://phabricator.wikimedia.org/T184749 [19:46:55] (03PS4) 10Dzahn: parsoid::testing: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/404063 [19:46:58] !log rebooting labvirt1008 [19:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:12] (03CR) 10Dzahn: "Arlolra: cool, thanks! :) applied on ruthenium, noop, nothing happened" [puppet] - 10https://gerrit.wikimedia.org/r/404063 (owner: 10Dzahn) [19:50:30] (03PS8) 10Dzahn: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [19:53:54] thcipriani: give me a ping when you push it out :) [19:54:00] (03CR) 10Dzahn: [C: 04-1] "i compiled it and it shows an error" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [19:54:44] (03CR) 10Dzahn: [C: 04-1] "it's rebased on my patch now and that is merged, though" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [19:55:48] addshore: I think it's possible to test this on mwdebug first. if I pull the wikiversions change over there and then run scap wikivesions-compile. Are you around to poke at that for a sec? [19:57:26] (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404531 [20:00:04] thcipriani: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180116T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:11] working on it [20:00:16] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3903991 (10demon) >>! In T181855#3903847, @Ottomata wrote: > Just curious, why not use git fat? We have a git-fat store available already, and it... [20:00:40] (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404531 (owner: 10Thcipriani) [20:01:21] thcipriani: yes if it lands on mwdebug i can test it [20:01:36] k, I'll put it there first :) [20:02:07] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404531 (owner: 10Thcipriani) [20:02:29] (03PS9) 10Arlolra: Switch to YAML configuration for Parsoid on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/403464 [20:02:48] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Next): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3904000 (10Ottomata) K cool, sounds good :) [20:02:58] (03CR) 10Arlolra: "Gah, sorry, poor rebase on my part. Should be fixed now" [puppet] - 10https://gerrit.wikimedia.org/r/403464 (owner: 10Arlolra) [20:03:50] !log rebooting labvirt1009 [20:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:21] addshore: group1 on wmf.16 is live on mwdebug1002 [20:04:48] so, that includes svwiktionary, let me go and try it out [20:05:07] yep [20:05:17] thcipriani: looks good to me [20:05:24] https://sv.wiktionary.org/w/index.php?title=Anv%C3%A4ndare:Addshore/sandbox&diff=3087458&oldid=3087455 [20:05:40] nice, ok, going live everywhere [20:06:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell for cy534 - https://phabricator.wikimedia.org/T184473#3904015 (10Ottomata) Ah, the doc was incorrect, `analytics-users` gives access to both stat1004 and stat1005. Just updated the doc. Not sure what `nuria.las... [20:06:56] 10Operations, 10Puppet, 10Puppet-infrastructure-modernization: Fix unknown variables warning that occur with puppet 4.x - https://phabricator.wikimedia.org/T184186#3904020 (10Ottomata) p:05Triage>03Normal [20:07:12] 10Operations, 10Page Content Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3897945 (10Pchelolo) I've found another issue here: for mobile content, redirects... [20:07:15] 10Operations, 10Traffic, 10media-storage: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#3904023 (10Ottomata) p:05Triage>03Normal [20:07:25] !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.31.0-wmf.16 [20:07:32] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632#3904031 (10Ottomata) p:05Triage>03Normal [20:07:43] ^ addshore should be live everywhere now, FYI [20:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:00] 10Operations, 10Scoring-platform-team, 10Wikimedia-Logstash, 10monitoring, 10Wikimedia-Incident: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630#3904032 (10Ottomata) p:05Triage>03Normal [20:08:18] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: What is causing ORES celery workers to suddenly require more CPU? - https://phabricator.wikimedia.org/T181621#3904033 (10Ottomata) p:05Triage>03Normal [20:08:33] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559#3904034 (10Ottomata) p:05Triage>03Normal [20:08:48] 10Operations, 10Scoring-platform-team: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546#3904035 (10Ottomata) p:05Triage>03Normal [20:09:08] 10Operations, 10Quarry, 10cloud-services-team (Kanban): let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205#3904036 (10Ottomata) p:05Triage>03Normal [20:09:32] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3904037 (10Ottomata) p:05Triage>03Normal [20:10:00] 10Operations, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#3904040 (10Ottomata) p:05Triage>03Normal [20:10:23] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: Add CI to all operations/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180330#3904041 (10Ottomata) p:05Triage>03Normal [20:10:45] 10Operations, 10Release-Engineering-Team (Watching / External), 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3904044 (10Ottomata) p:05Triage>03Normal [20:10:46] (03PS6) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) [20:10:54] 10Operations, 10media-storage: upload.wikimedia.org reports wrong mimetype for svg - https://phabricator.wikimedia.org/T179787#3904045 (10Ottomata) p:05Triage>03Normal [20:11:07] 10Operations, 10Puppet: Improve puppet alerting - https://phabricator.wikimedia.org/T178628#3904046 (10Ottomata) p:05Triage>03Normal [20:12:11] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404531 (owner: 10Thcipriani) [20:12:14] alright, so not that wmf.16 is live on group1 will watch it for a bit and then move on to all wikis. [20:12:27] :) [20:12:47] thanks for all your help addshore ! [20:13:02] anytime :) [20:13:43] 10Operations, 10Discovery, 10Recommendation-API, 10Wikidata, and 2 others: flapping monitoring for recommendation_api on scb - https://phabricator.wikimedia.org/T178445#3904054 (10Ottomata) p:05Triage>03Normal [20:14:06] 10Operations, 10Cloud-VPS, 10netops: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3904055 (10Ottomata) p:05Triage>03Low [20:14:47] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#3904056 (10Ottomata) p:05Triage>03Normal [20:15:08] (03CR) 10Dzahn: [C: 04-1] "tested on install2002, syntax eror :)" [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [20:15:16] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#3466440 (10Ottomata) BTW, +1 for this. It'd be especially cool if we applied the same puppet profile in labs and got the same grafana dashboards there. [20:20:22] !log rebooting labvirt1010 [20:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:07] PROBLEM - Host ores.wmflabs.org is DOWN: CRITICAL - Host Unreachable (ores.wmflabs.org) [20:22:17] PROBLEM - Host paws.wmflabs.org is DOWN: CRITICAL - Host Unreachable (paws.wmflabs.org) [20:24:50] !log temporarily disabling puppet agents while troubleshooting puppet crl [20:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:39] !log demon@tin Started scap: wmf.17 files, no bootstrap of i18n tho [20:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:26] no_justification: did you get that patch that was backported onto .16 but not on master too? [20:29:18] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=61%) [20:29:37] (03PS7) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) [20:29:47] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [20:31:07] RECOVERY - Host paws.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [20:32:14] !log re-enabling puppet agents [20:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:02] !log rebooting labvirt1011 [20:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:39] !log demon@tin scap aborted: wmf.17 files, no bootstrap of i18n tho (duration: 08m 59s) [20:34:57] ^ On purpose, I made a mistake, thcipriani [20:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:28] okie doke. [20:35:56] !log demon@tin Started scap: wmf.17 files, no bootstrap of i18n tho (x2) [20:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:29] thcipriani: I deleted some files from under the rsync, it was complaining [20:36:47] ah [20:37:37] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:57] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [20:42:30] !log demon@tin Finished scap: wmf.17 files, no bootstrap of i18n tho (x2) (duration: 06m 33s) [20:42:37] 6m33s! [20:42:39] Damn son [20:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:57] 10Operations, 10Puppet: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#3904134 (10herron) a:03herron After some further testing this combination of SSLCARevocation settings appears to work. ``` SSLCARevocationFile /var/lib/puppet/server/ssl/ca/c... [20:46:54] (03CR) 10Smalyshev: "Yes, for ease of debugging mostly. I'm ok with this change, with one addition - we need then to make the script output current timestamp t" [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel) [20:47:41] !log rebooting labvirt1012 [20:50:14] thcipriani: Ok, so I'm all done. wmf.17 is everywhere but without l10n [20:50:24] So good for when you're ready to move forward [20:50:36] no_justification: cool, thanks! I'll go ahead and roll out wmf.16 to all wikis now. [20:50:58] PROBLEM - Host labvirt1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:51:00] 10Operations, 10Page Content Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3904146 (10Pchelolo) Submitted a PR for RESTBase to fix inconsistencies on RB sid... [20:51:47] RECOVERY - Host labvirt1012 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [20:55:07] (03PS3) 10Ottomata: Add $monitoring_enabled parameter to cache::kafka::webrequest profile [puppet] - 10https://gerrit.wikimedia.org/r/403185 [20:56:37] (03CR) 10Ottomata: [C: 031] Allow to explicitly set the JAVA_HOME environment variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/403701 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [20:56:48] (03CR) 10Ottomata: [C: 032] Add $monitoring_enabled parameter to cache::kafka::webrequest profile [puppet] - 10https://gerrit.wikimedia.org/r/403185 (owner: 10Ottomata) [20:57:06] 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a/b/c-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3904155 (10ayounsi) Indeed, top diagram is for row A and B, bottom one for row C, which will not have a asw switch in the frack rack. [20:57:43] (03PS1) 10Thcipriani: all wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404536 [20:59:12] !log rebooting labvirt1013 [20:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:29] (03CR) 10Thcipriani: [C: 032] all wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404536 (owner: 10Thcipriani) [21:02:35] (03Merged) 10jenkins-bot: all wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404536 (owner: 10Thcipriani) [21:02:51] (03CR) 10jenkins-bot: all wikis to 1.31.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404536 (owner: 10Thcipriani) [21:03:19] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3904171 (10Ottomata) Do we still want to do this? [21:04:25] !log thcipriani@tin rebuilt and synchronized wikiversions files: all wikis to 1.31.0-wmf.16 [21:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:02] 10Operations, 10MediaWiki-Vagrant, 10Release-Engineering-Team, 10Epic, and 2 others: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3904228 (10jgleeson) [21:15:15] !log rebooting labvirt1014 and 1015 [21:15:21] starting to see this one creep up in the logs a bit https://phabricator.wikimedia.org/T185037 not huge yet but worrisome. [21:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:46] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3904264 (10ayounsi) About > Puppet is broken on netmon2001 because postgres is not installed Is because package installation is done in https://github.com/wikime... [21:29:18] (03PS1) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [21:29:45] (03CR) 10jerkins-bot: [V: 04-1] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [21:31:07] (03PS2) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [21:40:58] 10Operations, 10Continuous-Integration-Infrastructure, 10HHVM: HHVM 3.18.5+dfsg-1+wmf3 changes parse_url causing unit tests to fail - https://phabricator.wikimedia.org/T185024#3903647 (10Anomie) It looks like someone upstream misread RFC 3986 near [[https://github.com/facebook/hhvm/commit/80855dc1f2fe4d9de6b... [21:50:44] !log thcipriani@tin Started scap: testwiki to php-1.31.0-wmf.17 and rebuild l10n cache [21:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:38] (03PS3) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [21:56:41] I suspect wmf.17 will start producing "WikimediaMessages.hooks.php: Call to undefined method ORES\Hooks::isModelEnabled()" to the user when viewing Special:Preferences and possibly in other places. It breaks locally on master. Here's my patch for it: https://gerrit.wikimedia.org/r/#/c/404542/ [21:57:07] (03CR) 10jerkins-bot: [V: 04-1] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [21:57:26] (03PS4) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [21:57:48] (03CR) 10jerkins-bot: [V: 04-1] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [22:00:18] stephanebisson: thank you for that! I can get that backported to wmf.17 once I'm done with the l10n update for wmf.17. In the interim if you could get someone to merge it into master that'd be great :) [22:01:56] thcipriani: I'll try (pinging RoanKattouw, awight, Amir1 to review/merge https://gerrit.wikimedia.org/r/#/c/404542/) [22:02:42] +2ed [22:02:44] Also, yikes [22:04:14] thcipriani: is the train still ongoing? [22:04:14] thanks both [22:04:19] mobrovac: indeed :( [22:04:31] ah [22:04:33] kk [22:05:11] (03PS5) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [22:05:38] (03CR) 10jerkins-bot: [V: 04-1] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [22:06:38] (03PS6) 10Ottomata: Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) [22:06:42] (03CR) 10Dzahn: [C: 032] "tested on install2002 and now it is alright" [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [22:06:48] (03PS8) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) [22:14:09] (03CR) 10Ppchelko: [C: 031] Ensure specific librdkafka version for changeprop and eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [22:15:36] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/9752/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [22:16:23] (03PS1) 10Herron: puppetmaster::ssl: fix crl file suffix [puppet] - 10https://gerrit.wikimedia.org/r/404587 (https://phabricator.wikimedia.org/T184444) [22:16:30] !log thcipriani@tin Finished scap: testwiki to php-1.31.0-wmf.17 and rebuild l10n cache (duration: 25m 45s) [22:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:09] (03PS1) 10Rush: cloud: labvirt settle on meltdown kernel [puppet] - 10https://gerrit.wikimedia.org/r/404588 (https://phabricator.wikimedia.org/T184189) [22:19:00] !log apt-get install librdkafka1=0.9.4-1~jessie1 librdkafka++1=0.9.4-1~jessie1 on scb* to put librdkafka back at node-rdkafka compat version (somehow this was upgraded yesterday...very dangerous!!) [22:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:55] (03CR) 10Andrew Bogott: [C: 032] cloud: labvirt settle on meltdown kernel [puppet] - 10https://gerrit.wikimedia.org/r/404588 (https://phabricator.wikimedia.org/T184189) (owner: 10Rush) [22:19:55] !log thcipriani@tin Synchronized php-1.31.0-wmf.17/extensions/WikimediaMessages/WikimediaMessages.hooks.php: [[gerrit:404583|Update access to ORES isModelEnabled()]] (duration: 01m 13s) [22:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:15] ^ stephanebisson I got your backport merged and sync for wmf.17, thank you for that. [22:21:00] (03CR) 10Ppchelko: "Apparently, the packages were updated manually, so this might not be needed. Although it might be a good idea to pin it anyway to protect " [puppet] - 10https://gerrit.wikimedia.org/r/404540 (https://phabricator.wikimedia.org/T176126) (owner: 10Ottomata) [22:23:56] thcipriani: done? :P [22:25:03] are you waiting on me to do something? I've got to bump group0 to wmf.17 and run sync-wikiversions but I'm having a little trouble with it at the moment so it may take a few :\ [22:26:03] thcipriani: yeah, kind of, need to get https://gerrit.wikimedia.org/r/#/c/403704/ out [22:26:36] relates to jobrunners only [22:29:58] mobrovac: how long do you estimate that will take? It may take me a minute to figure out what's going wrong here. So if it's not going to take long you can jump in. [22:30:33] thcipriani: 10 mins or so i suppose, not more [22:30:50] 10Operations, 10Puppet, 10Patch-For-Review: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#3904520 (10herron) Did some further testing and have a fix that better integrates with our puppetization and works with `SSLCARevocationCheck chain` It looks li... [22:31:01] mobrovac: go for it. let me know when you're finished. It'll probably take me more than that to figure this out :) [22:31:06] kk thnx [22:33:47] (03PS2) 10Mobrovac: JobQueue: Use EventBus for HTMLCacheUpdate except en, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403703 (https://phabricator.wikimedia.org/T182023) [22:34:58] thcipriani: What's going on now? [22:35:22] no_justification: I think I figured it out. Adding closedwiki wikis to group0 freaked me out is all :) [22:35:46] (03CR) 10Mobrovac: [C: 032] JobQueue: Use EventBus for HTMLCacheUpdate except en, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403703 (https://phabricator.wikimedia.org/T182023) (owner: 10Mobrovac) [22:36:34] thcipriani: Ahhh, yeah that's a thing now! [22:36:44] it's definitely been a good long while since I last conducted :) [22:37:09] Think of me as the hobo riding in the boxcar :p [22:37:21] (03Merged) 10jenkins-bot: JobQueue: Use EventBus for HTMLCacheUpdate except en, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403703 (https://phabricator.wikimedia.org/T182023) (owner: 10Mobrovac) [22:38:20] (03CR) 10jenkins-bot: JobQueue: Use EventBus for HTMLCacheUpdate except en, commons, wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/403703 (https://phabricator.wikimedia.org/T182023) (owner: 10Mobrovac) [22:38:55] no_justification: I wish I could draw [22:39:24] !log ppchelko@tin Started deploy [cpjobqueue/deploy@19b9bdd]: Switch htmlCacheUpdates for all but en, commons, wikidata T182023 [22:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:39] T182023: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023 [22:39:46] greg-g: I can't draw either, it's k [22:39:58] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@19b9bdd]: Switch htmlCacheUpdates for all but en, commons, wikidata T182023 (duration: 00m 35s) [22:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:23] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Use EventBus for htmlCacheUpdate jobs for all wikis but en, commons and wikidata - T182023 (duration: 01m 12s) [22:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:13] (03PS1) 10Jgreen: uncomment ptr record for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/404590 [22:42:42] (03CR) 10Jgreen: [C: 032] uncomment ptr record for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/404590 (owner: 10Jgreen) [22:45:38] thcipriani: i'm {{done}} [22:45:45] mobrovac: thanks :) [22:46:06] thcipriani: thank you sir :) [22:46:13] * thcipriani doffs hat [22:48:43] (03PS1) 10EBernhardson: Remove cirrus AB test config for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404592 (https://phabricator.wikimedia.org/T182616) [22:49:03] (03PS1) 10Thcipriani: Group0 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404593 [22:50:42] (03CR) 10Thcipriani: [C: 032] Group0 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404593 (owner: 10Thcipriani) [22:52:12] (03Merged) 10jenkins-bot: Group0 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404593 (owner: 10Thcipriani) [22:52:24] (03CR) 10jenkins-bot: Group0 to 1.31.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404593 (owner: 10Thcipriani) [22:52:51] (03CR) 10Paladox: [C: 031] "> I don't see anything in the 2.14.7 log thats super important. We're" [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [22:53:59] !log thcipriani@tin rebuilt and synchronized wikiversions files: group0 to 1.31.0-wmf.17 [22:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:42] /who [22:54:55] * cwd sorry [22:57:36] !log niharika29@tin Started deploy [scholarships/scholarships@728d203]: Update privacy statement and delete invalidated translation files. T184659 [22:57:38] !log niharika29@tin Finished deploy [scholarships/scholarships@728d203]: Update privacy statement and delete invalidated translation files. T184659 (duration: 00m 02s) [22:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:50] T184659: Update 2018 privacy statement - https://phabricator.wikimedia.org/T184659 [22:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:14] cwd: http://music.wookiefoot.com/track/intro-3 [23:01:27] greg-g: msp flashback :) [23:01:44] cwd: it was a targeted choice, indeed [23:07:52] (03PS1) 10Ppchelko: [JobQueue] Enable htmlCacheUpdate on new infrastructure for all projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404598 (https://phabricator.wikimedia.org/T182023) [23:09:46] (03CR) 10jerkins-bot: [V: 04-1] [JobQueue] Enable htmlCacheUpdate on new infrastructure for all projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404598 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [23:12:39] (03PS2) 10Ppchelko: [JobQueue] Enable htmlCacheUpdate on new infrastructure for all projects. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404598 (https://phabricator.wikimedia.org/T182023) [23:17:35] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:41:00] (03PS1) 10MaxSem: Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 [23:41:19] (03CR) 10jerkins-bot: [V: 04-1] Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 (owner: 10MaxSem) [23:42:35] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [23:42:40] (03PS2) 10MaxSem: Add my new key [puppet] - 10https://gerrit.wikimedia.org/r/404604 [23:51:48] (03Abandoned) 10Dzahn: DHCP: switch all to http to serve installer [puppet] - 10https://gerrit.wikimedia.org/r/404054 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [23:56:42] (03PS1) 10Dzahn: DHCP: switch to http to retrieve installer image [puppet] - 10https://gerrit.wikimedia.org/r/404607 (https://phabricator.wikimedia.org/T182215)