[00:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T0000). [00:00:06] tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:26] !log LDAP - added uid 'raz-shuty' to group 'wmde' (T187442) [00:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:42] T187442: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442 [00:01:04] o/ [00:02:03] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3991473 (10Dzahn) @RazShuty done! you have been added to the group. I think you have to log and log back in on Gerrit. Let us know if any issues. [00:03:46] tgr: We just merged some emergency back-ports for an HHVM crasher, might slow things down. [00:04:32] James_F: it's a config patch so shouldn't be affected too much [00:09:43] I should be done shortl [00:09:44] y [00:10:04] oh, you mean you are still deploying? [00:10:49] I'm pressing the launch button now [00:12:14] !log demon@tin Synchronized php-1.31.0-wmf.21/includes/media/JpegMetadataExtractor.php: T184048 (duration: 01m 21s) [00:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:27] T184048: HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048 [00:12:39] stashbot's been slow lately :\ [00:12:39] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [00:12:42] no rush on my behalf, I just misunderstood James first and thoughts it's about zuul pileups [00:13:38] !log demon@tin Synchronized php-1.31.0-wmf.22/includes/media/JpegMetadataExtractor.php: T184048 (duration: 01m 13s) [00:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:38] tgr: I'm done [00:15:53] thx [00:16:10] I'll do a self-service SWAT then [00:17:38] !log mholloway-shell@tin Started deploy [mobileapps/deploy@8ffb03b]: Update mobileapps to a1339a9 [00:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:28] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:20:41] (03Merged) 10jenkins-bot: Enable loginOnly mode for local auth provider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:20:51] (03CR) 10jenkins-bot: Enable loginOnly mode for local auth provider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:22:18] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991510 (10ayounsi) [00:23:42] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@8ffb03b]: Update mobileapps to a1339a9 (duration: 06m 05s) [00:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:05] (03PS1) 10Dzahn: introduce kafkamon1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) [00:25:11] !log tgr@tin Synchronized wmf-config/CommonSettings-labs.php: T57420 enable loginOnly flag in beta (duration: 01m 12s) [00:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:27] T57420: Remove local wiki password hash when CentralAuth has attached account - https://phabricator.wikimedia.org/T57420 [00:27:19] paladox: wow, does Gerrit now remember the people i most commonly add as reviewers ... and suggests them in the drop down before you even search [00:28:22] 10Operations, 10HHVM, 10MW-1.31-release-notes (WMF-deploy-2018-02-13 (1.31.0-wmf.21)), 10Performance-Team (Radar): HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#3991530 (10Jdforrester-WMF) So… theoretically this should now be fixed, and we'll see fewer hung processes on the API... [00:28:36] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#3991533 (10ayounsi) [00:33:14] (03PS1) 10Dzahn: partman: add kafkamon[1-2]00[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) [00:38:33] (03CR) 10Smalyshev: wdqs: allow configuration of kafka based updates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [00:40:51] !log smalyshev@tin Started deploy [wdqs/wdqs@5131080]: update whitelist to include categories namespace [00:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:18] !log smalyshev@tin Finished deploy [wdqs/wdqs@5131080]: update whitelist to include categories namespace (duration: 00m 27s) [00:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:22] !log smalyshev@tin Started deploy [wdqs/wdqs@5131080]: update whitelist to include categories namespace [00:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:29] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3991581 (10demon) So we think `--delete-excluded` will solve this. The patch above adds s... [00:44:40] 10Operations, 10Deployments, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3991582 (10demon) p:05Triage>03Low [00:45:53] PROBLEM - Blazegraph process on wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [00:46:13] PROBLEM - Blazegraph Port on wdqs1003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [00:46:14] PROBLEM - WDQS HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time [00:46:21] SMalyshev: ^ [00:46:29] !log smalyshev@tin Finished deploy [wdqs/wdqs@5131080]: update whitelist to include categories namespace (duration: 03m 07s) [00:46:37] ebernhardson: yeah I know. for some reason deployment is fubar [00:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:48] I rolled it back [00:46:53] RECOVERY - Blazegraph process on wdqs1003 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [00:47:13] RECOVERY - Blazegraph Port on wdqs1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [00:47:14] RECOVERY - WDQS HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.023 second response time [00:48:03] weird thing the only thing I changed is a text file... well, will try to figure it out with gehel tomorrow [00:57:41] (03CR) 10BryanDavis: tools-static: Change to reverse proxy of cdnjs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T0100). [01:00:05] No GERRIT patches in the queue for this window AFAICS. [01:01:10] !log Running cleanupBlocks.php on mediawikiwiki for T187834 [01:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:23] T187834: Username not recognized by the extension - https://phabricator.wikimedia.org/T187834 [01:05:48] !log Running cleanupBlocks.php on more wikis for T187834: alswiki bgwiki bhwiki cawiki dewiki elwiki eswiki frwiki hewiki hiwiki huwiki hywiki jawiki jawikibooks jawikinews jawikiquote jawikisource jawiktionary kawiki kowiki mswiki mswiktionary rowiki sourceswiki [01:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:46] mutante: yes I think so [01:12:05] paladox: nice feature, just noticed it the first time [01:13:42] (03PS2) 10Dzahn: introduce kafkamon1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) [01:16:28] (03PS3) 10Dzahn: introduce kafkamon1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) [01:24:03] mutante: :) [01:24:44] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:37:57] (03CR) 10Dereckson: [C: 031] Add romd.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [01:52:22] (03CR) 10Bstorm: tools-static: Change to reverse proxy of cdnjs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [01:59:15] (03PS5) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) [02:17:03] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:24:22] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.21) (duration: 05m 53s) [02:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:03] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:24:12] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3991730 (10Volker_E) @Dzahn So it would be a request similar to @bmansurov's on RLP about cloning the corresponding GitHub repo? @... [03:27:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 820.40 seconds [04:01:34] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 209.67 seconds [04:12:48] (03PS2) 10Krinkle: Use $wgDBname instead of IDatabase::getDBname in feed config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412837 [04:13:46] (03CR) 10BryanDavis: [C: 031] tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [05:03:16] (03CR) 10Zhuyifei1999: [C: 031] tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [05:20:14] PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 3.180 second response time [05:21:13] RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [06:21:22] (03PS2) 10Marostegui: MariaDB: Setup db1115 and db2093 as new tendril databases [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [06:21:46] !log Stop puppet and mysql on db1011 to get ready to copy its data to db1115 - T184704 [06:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:04] T184704: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704 [06:23:32] (03CR) 10Marostegui: [C: 032] MariaDB: Setup db1115 and db2093 as new tendril databases [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [06:30:21] (03CR) 10Marostegui: [V: 032 C: 032] MariaDB: Setup db1115 and db2093 as new tendril databases [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [06:30:54] what's going on with gerrit? why can't I merge patches? :| [06:33:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413301 [06:33:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413301 [06:33:27] let's see if the same thing happens with mediawiki-config branches… [06:34:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413301 (owner: 10Marostegui) [06:36:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413301 (owner: 10Marostegui) [06:36:33] PROBLEM - Nginx local proxy to apache on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:56] right, so merges can happen with mediawiki but not with puppet? [06:37:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413301 (owner: 10Marostegui) [06:37:23] RECOVERY - Nginx local proxy to apache on mw2129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.208 second response time [06:38:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105 (duration: 01m 13s) [06:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:59] (03Abandoned) 10Elukey: role::archiva: move to java 8 [puppet] - 10https://gerrit.wikimedia.org/r/410445 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [06:40:31] (03Abandoned) 10Elukey: Fix and tune the new Analytics Hadoop alarms [puppet] - 10https://gerrit.wikimedia.org/r/337574 (https://phabricator.wikimedia.org/T88640) (owner: 10Elukey) [06:44:10] (03CR) 10Elukey: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [06:44:18] (03PS1) 10Marostegui: tendril.pp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/413302 (https://phabricator.wikimedia.org/T184704) [06:45:47] (03CR) 10Marostegui: [C: 032] tendril.pp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/413302 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [06:45:57] (03CR) 10Elukey: partman: add kafkamon[1-2]00[0-9] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [06:47:23] (03PS3) 10Elukey: role::analytics_cluster::coordinator: enable mon. for oozie|hive [puppet] - 10https://gerrit.wikimedia.org/r/413189 (https://phabricator.wikimedia.org/T184794) [06:53:26] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10076/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413189 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [06:57:53] no_justification: Hello Sir, Are you here? [07:00:25] addshore, Antoine (hashar), Brad (anomie), Katie (aude), Max (MaxSem), Mukunda (twentyafterfour), Roan (RoanKattouw), Sébastien (Dereckson), Tyler (thcipriani), Niharika (Niharika), or Željko (zeljkof) : Hello [07:00:35] Please See https://phabricator.wikimedia.org/T185977#3991907 [07:01:35] Jayprakash12345: OK I'll run that [07:01:56] RoanKattouw: Thanks [07:02:55] Jayprakash12345: Done [07:06:18] (03PS1) 10Marostegui: Revert "tendril.pp: Fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/413304 [07:06:30] (03PS2) 10Marostegui: Revert "tendril.pp: Fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/413304 [07:07:02] (03CR) 10Marostegui: [C: 032] Revert "tendril.pp: Fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/413304 (owner: 10Marostegui) [07:07:21] (03PS1) 10Marostegui: Revert "MariaDB: Setup db1115 and db2093 as new tendril databases" [puppet] - 10https://gerrit.wikimedia.org/r/413305 [07:07:26] (03PS2) 10Marostegui: Revert "MariaDB: Setup db1115 and db2093 as new tendril databases" [puppet] - 10https://gerrit.wikimedia.org/r/413305 [07:07:53] (03CR) 10jerkins-bot: [V: 04-1] Revert "MariaDB: Setup db1115 and db2093 as new tendril databases" [puppet] - 10https://gerrit.wikimedia.org/r/413305 (owner: 10Marostegui) [07:08:23] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 503 (expecting: 200) [07:09:14] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:10:31] (03CR) 10Marostegui: [V: 032 C: 032] Revert "MariaDB: Setup db1115 and db2093 as new tendril databases" [puppet] - 10https://gerrit.wikimedia.org/r/413305 (owner: 10Marostegui) [07:19:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413306 (https://phabricator.wikimedia.org/T187089) [07:21:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413306 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:22:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413306 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:22:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413306 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [07:24:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 for alter table (duration: 01m 14s) [07:24:33] !log Stop MySQL on db1076 for mariadb and kernel upgrade + alter table [07:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:19] * elukey saw a marostegui's alter table msg and can start the day [07:25:26] XDDDDDDD [07:25:41] morning to yout oo [07:27:30] <3 [07:32:04] !log Deploy schema change on db1076 - T187089 T185128 T153182 [07:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:19] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [07:32:19] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [07:32:19] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [07:32:41] <_joe_> !log starting tests on mwdebug1001 again [07:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:57] (03CR) 10Krinkle: [C: 032] Use $wgDBname instead of IDatabase::getDBname in feed config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412837 (owner: 10Krinkle) [07:41:29] (03Merged) 10jenkins-bot: Use $wgDBname instead of IDatabase::getDBname in feed config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412837 (owner: 10Krinkle) [07:41:42] (03CR) 10jenkins-bot: Use $wgDBname instead of IDatabase::getDBname in feed config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412837 (owner: 10Krinkle) [07:43:41] (03PS1) 10Jcrespo: Revert "Revert "MariaDB: Setup db1115 and db2093 as new tendril databases"" [puppet] - 10https://gerrit.wikimedia.org/r/413309 [07:44:50] Testing ^ patch on mwdebug1002 [07:47:35] <_joe_> Krinkle: ok, my tests on mwdebug1001 are non-disruptive for now, btw [07:47:44] _joe_: syncing, [07:47:50] (03PS2) 10Jcrespo: Revert "Revert "MariaDB: Setup db1115 and db2093 as new tendril databases"" [puppet] - 10https://gerrit.wikimedia.org/r/413309 [07:47:52] <_joe_> so you can use it as well [07:48:13] _joe_: does your test involve changes in /srv/mediawiki ? [07:48:23] <_joe_> nope [07:48:27] k [07:48:28] !log krinkle@tin Synchronized wmf-config/FeaturedFeedsWMF.php: I73945d7d - minor clean-up (duration: 01m 13s) [07:48:33] <_joe_> just iptables and profanity :P [07:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:47] <_joe_> for now at least, later they can include nc or similar thing [07:48:50] <_joe_> *things [07:50:49] (03PS1) 10Muehlenhoff: Extend access for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/413310 [07:52:16] (03CR) 10Muehlenhoff: [C: 032] Extend access for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/413310 (owner: 10Muehlenhoff) [07:53:00] (03PS3) 10Jcrespo: Revert "Revert "MariaDB: Setup db1115 and db2093 as new tendril databases"" [puppet] - 10https://gerrit.wikimedia.org/r/413309 [08:05:23] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:24] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:24] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:24] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:05:28] !log Disable puppet on db1011 - T184704 [08:05:33] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:41] T184704: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704 [08:05:54] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:54] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:54] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:23] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:33] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:06:43] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:54] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:07:24] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:08:01] (03PS4) 10Jcrespo: Revert "Revert "MariaDB: Setup db1115 and db2093 as new tendril databases"" [puppet] - 10https://gerrit.wikimedia.org/r/413309 [08:08:39] (03PS8) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [08:08:55] (03CR) 10Jcrespo: [C: 032] Revert "Revert "MariaDB: Setup db1115 and db2093 as new tendril databases"" [puppet] - 10https://gerrit.wikimedia.org/r/413309 (owner: 10Jcrespo) [08:09:47] (03PS9) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [08:11:33] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:11:44] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:12:15] this is bohrium --^ [08:13:19] (03CR) 10Gehel: [C: 032] elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [08:15:53] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [08:15:53] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [08:15:53] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [08:15:53] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [08:15:54] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 16%, RTA = 0.37 ms [08:16:03] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [08:16:13] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [08:16:13] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [08:16:23] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [08:16:23] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [08:16:23] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [08:17:04] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:17:34] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-wmf-elasticsearch-exporter] [08:17:53] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [08:18:43] (03PS1) 10Gehel: elasticsearch: fix typo in prometheus-wmf-exporter systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/413312 [08:19:19] (03CR) 10Gehel: [C: 032] elasticsearch: fix typo in prometheus-wmf-exporter systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/413312 (owner: 10Gehel) [08:20:53] (03CR) 10Ema: [C: 032] icinga: promote check_established_connections alerts to critical [puppet] - 10https://gerrit.wikimedia.org/r/413208 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema) [08:20:59] (03PS2) 10Ema: icinga: promote check_established_connections alerts to critical [puppet] - 10https://gerrit.wikimedia.org/r/413208 (https://phabricator.wikimedia.org/T170847) [08:21:02] (03CR) 10Ema: [V: 032 C: 032] icinga: promote check_established_connections alerts to critical [puppet] - 10https://gerrit.wikimedia.org/r/413208 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema) [08:22:33] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:23:13] (03PS1) 10Gehel: elasticsearch: fix source instead of content for prometheus-wmf-exporter [puppet] - 10https://gerrit.wikimedia.org/r/413313 [08:24:05] !log ulsfo LVSs: upgrade pybal to 1.14.4 [08:24:15] (03PS2) 10Gehel: elasticsearch: fix source instead of content for prometheus-wmf-exporter [puppet] - 10https://gerrit.wikimedia.org/r/413313 [08:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:48] (03CR) 10Gehel: [C: 032] elasticsearch: fix source instead of content for prometheus-wmf-exporter [puppet] - 10https://gerrit.wikimedia.org/r/413313 (owner: 10Gehel) [08:27:53] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:29:03] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:29:23] PROBLEM - Check systemd state on elastic2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:22] PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:02] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:02] PROBLEM - Check systemd state on elastic2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:04] PROBLEM - Check systemd state on elastic1043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:23] PROBLEM - Check systemd state on elastic1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:32] PROBLEM - Check systemd state on elastic2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:31:43] PROBLEM - Check systemd state on elastic1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:32:02] !log esams LVSs: upgrade pybal to 1.14.4 [08:32:12] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:32:12] PROBLEM - Check systemd state on elastic2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:32] PROBLEM - Check systemd state on elastic2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:32:40] ^ elasticsearch is me, fix coming up [08:32:57] (03PS1) 10Gehel: elasticsearch: fix ferm rule for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/413314 [08:33:45] (03CR) 10Gehel: [C: 032] elasticsearch: fix ferm rule for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/413314 (owner: 10Gehel) [08:33:52] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [08:35:34] !log Stop tendril database (db1011) to copy it to db1115 - tendril will be offline while the copy is in progress - T184704 [08:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:49] T184704: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704 [08:36:31] overall, not a good start of day [08:36:36] * gehel blames lack of coffee [08:45:02] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [08:45:19] this is me --^ [08:47:02] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [08:47:30] !log codfw LVSs: upgrade pybal to 1.14.4 [08:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:24] !log tendril and dbtree database currently under maintanance [08:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:39] I think people know more about dbtree than the other [08:48:57] hehe could be - good point [08:49:22] RECOVERY - Check systemd state on relforge1001 is OK: OK - running: The system is fully operational [08:49:30] (03PS1) 10Krinkle: profiler-labs: Call xhprof_enable earlier to match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413315 [08:49:54] (03PS2) 10Krinkle: profiler-labs: Call xhprof_enable earlier to match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413315 (https://phabricator.wikimedia.org/T180183) [08:50:03] (03CR) 10Krinkle: [C: 032] profiler-labs: Call xhprof_enable earlier to match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413315 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [08:51:35] (03Merged) 10jenkins-bot: profiler-labs: Call xhprof_enable earlier to match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413315 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [08:51:50] (03CR) 10jenkins-bot: profiler-labs: Call xhprof_enable earlier to match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413315 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [08:54:12] RECOVERY - Check systemd state on relforge1002 is OK: OK - running: The system is fully operational [08:56:42] (03PS1) 10Elukey: role::analytics_cluster::coordinator: rollback oozie prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/413316 (https://phabricator.wikimedia.org/T184794) [08:57:25] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: rollback oozie prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/413316 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [08:57:32] RECOVERY - Check systemd state on elastic2017 is OK: OK - running: The system is fully operational [08:58:32] RECOVERY - Check systemd state on elastic2030 is OK: OK - running: The system is fully operational [08:58:42] RECOVERY - Check systemd state on elastic2018 is OK: OK - running: The system is fully operational [08:59:13] RECOVERY - Check systemd state on elastic1043 is OK: OK - running: The system is fully operational [08:59:13] RECOVERY - Check systemd state on elastic2007 is OK: OK - running: The system is fully operational [08:59:15] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3992080 (10ema) [08:59:18] 10Operations, 10Pybal, 10Traffic, 10monitoring, 10Patch-For-Review: Icinga check for pybal HTTP connections to etcd - https://phabricator.wikimedia.org/T170847#3992078 (10ema) 05Open>03Resolved a:03ema [08:59:22] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [08:59:22] RECOVERY - Check systemd state on elastic2001 is OK: OK - running: The system is fully operational [08:59:32] RECOVERY - Check systemd state on elastic1018 is OK: OK - running: The system is fully operational [08:59:52] RECOVERY - Check systemd state on elastic1026 is OK: OK - running: The system is fully operational [09:00:12] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational [09:00:42] RECOVERY - Check systemd state on elastic2011 is OK: OK - running: The system is fully operational [09:02:07] (03PS1) 10Marostegui: wmnet: Change tendril backend to db1115 [dns] - 10https://gerrit.wikimedia.org/r/413317 (https://phabricator.wikimedia.org/T184704) [09:02:30] (03CR) 10Marostegui: [C: 04-1] "Wait for the data to be transferred" [dns] - 10https://gerrit.wikimedia.org/r/413317 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [09:03:02] !log eqiad LVSs: upgrade pybal to 1.14.4 [09:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:48] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3992082 (10Marostegui) [09:05:59] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3992094 (10Marostegui) p:05Triage>03Normal [09:12:55] (03PS1) 10Elukey: role::prometheus::analytics: add Hive Prometheus poll config [puppet] - 10https://gerrit.wikimedia.org/r/413320 (https://phabricator.wikimedia.org/T184794) [09:13:29] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: add Hive Prometheus poll config [puppet] - 10https://gerrit.wikimedia.org/r/413320 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey) [09:14:18] (03PS6) 10Jcrespo: Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 [09:14:20] (03PS1) 10Jcrespo: dblists: Updates for db2044 movement [software] - 10https://gerrit.wikimedia.org/r/413321 (https://phabricator.wikimedia.org/T187886) [09:14:47] (03CR) 10Jcrespo: [V: 032 C: 032] dblists: Updates for db2044 movement [software] - 10https://gerrit.wikimedia.org/r/413321 (https://phabricator.wikimedia.org/T187886) (owner: 10Jcrespo) [09:14:51] (03PS2) 10Jcrespo: dblists: Updates for db2044 movement [software] - 10https://gerrit.wikimedia.org/r/413321 (https://phabricator.wikimedia.org/T187886) [09:14:54] (03CR) 10Jcrespo: [V: 032 C: 032] dblists: Updates for db2044 movement [software] - 10https://gerrit.wikimedia.org/r/413321 (https://phabricator.wikimedia.org/T187886) (owner: 10Jcrespo) [09:15:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db11104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413322 (https://phabricator.wikimedia.org/T186321) [09:16:52] (03PS1) 10Marostegui: db1104: Switch binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/413323 (https://phabricator.wikimedia.org/T186321) [09:16:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db11104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413322 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [09:18:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db11104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413322 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [09:18:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db11104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413322 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [09:19:46] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#3992125 (10Scoopfinder) p:05Triage>03Normal [09:19:49] !log rebooting multatuli [09:19:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 - T186321 (duration: 01m 13s) [09:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:14] !log Stop MySQL on db1104 to switch its binlog to statement - T186321 [09:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:17] (03CR) 10Marostegui: [C: 032] db1104: Switch binlog to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/413323 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [09:20:17] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [09:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [09:21:42] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0 [09:23:44] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#3992135 (10Scoopfinder) [09:26:13] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413324 [09:26:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.165 second response time [09:28:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413324 (owner: 10Marostegui) [09:29:58] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413324 (owner: 10Marostegui) [09:30:12] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413324 (owner: 10Marostegui) [09:31:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 - T186321 (duration: 01m 12s) [09:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:42] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [09:31:53] check_wikidata seems ok to me [09:31:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1943 bytes in 0.104 second response time [09:37:10] (03CR) 10Jcrespo: [C: 031] wmnet: Change tendril backend to db1115 [dns] - 10https://gerrit.wikimedia.org/r/413317 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [09:37:20] (03CR) 10Gehel: wdqs: allow configuration of kafka based updates (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [09:37:47] (03PS8) 10Gehel: wdqs: allow configuration of kafka based updates [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) [09:41:54] (03CR) 10Filippo Giunchedi: [C: 031] Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) (owner: 10Gilles) [09:42:11] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413327 [09:44:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413327 (owner: 10Marostegui) [09:46:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413327 (owner: 10Marostegui) [09:46:52] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413327 (owner: 10Marostegui) [09:47:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1104 - T186321 (duration: 01m 12s) [09:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:57] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [09:53:18] (03PS1) 10Jcrespo: install_server: Set db2090, db2073 to reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413328 (https://phabricator.wikimedia.org/T170662) [09:55:48] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1089, depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413329 (https://phabricator.wikimedia.org/T162807) [09:57:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1089, depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413329 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:59:19] !log reboot kraz.wikimedia.org (irc.wikimedia.org) [09:59:20] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1089, depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413329 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:59:30] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1089, depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413329 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:38] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9 (caused by retpoline changes) [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/413330 [10:00:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 with low traffic and depool db1067 - T162807 (duration: 01m 12s) [10:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:06] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:05:24] (03CR) 10Muehlenhoff: [C: 032] Bump meta package for new ABI in 4.9 (caused by retpoline changes) [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/413330 (owner: 10Muehlenhoff) [10:05:45] (03CR) 10Jcrespo: [C: 032] install_server: Set db2090, db2073 to reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413328 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:05:50] (03PS2) 10Jcrespo: install_server: Set db2090, db2073 to reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413328 (https://phabricator.wikimedia.org/T170662) [10:06:27] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413331 [10:07:09] (03PS1) 10Jcrespo: mariadb: Depool db2073 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413332 (https://phabricator.wikimedia.org/T170662) [10:07:53] jynus: you go first :) [10:08:41] !log uploaded Linux 4.9.82-1~wmf1 for jessie-wikimedia to apt.wikimedia.org (retpoline-enabled kernel) [10:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:39] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2073 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413332 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:10:57] (03CR) 10jenkins-bot: mariadb: Depool db2073 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413332 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:11:45] db1064.eqiad.wmnet is twice on puppet [10:12:58] true [10:13:01] I see it [10:13:57] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2073 for maintenance (duration: 01m 12s) [10:14:01] (03PS1) 10Ema: cache_text: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/413333 (https://phabricator.wikimedia.org/T184448) [10:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:16] (03PS1) 10Marostegui: site.pp: Remove duplicated db1064 [puppet] - 10https://gerrit.wikimedia.org/r/413334 [10:14:31] jynus: ^ [10:15:03] (03PS1) 10Jcrespo: mariadb: Add db2090 to s4 section [puppet] - 10https://gerrit.wikimedia.org/r/413335 (https://phabricator.wikimedia.org/T170662) [10:15:15] <_joe_> bbiab [10:15:48] (03CR) 10Ema: [C: 032] cache_text: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/413333 (https://phabricator.wikimedia.org/T184448) (owner: 10Ema) [10:16:02] I had created mine already [10:16:26] Ah, cool merge it (the commit says db2064 btw) [10:16:29] I will abandon mine [10:16:44] I almost prefer not to separate it on puppet [10:16:45] 10Operations, 10monitoring: Upgrade to Prometheus 2.x - https://phabricator.wikimedia.org/T187987#3992211 (10fgiunchedi) [10:16:53] sure [10:16:57] only on mediawiki [10:17:12] (03Abandoned) 10Marostegui: site.pp: Remove duplicated db1064 [puppet] - 10https://gerrit.wikimedia.org/r/413334 (owner: 10Marostegui) [10:17:22] I don't think we do it for others, but correct me if I am wrong [10:17:31] no no, that's fine [10:18:05] !log upgrade cache_text @ ulsfo to varnish 5 [10:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413331 (owner: 10Marostegui) [10:19:20] (03PS2) 10Marostegui: db-eqiad.php: Increase traffic for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413331 [10:20:50] (03PS2) 10Jcrespo: mariadb: Add db2090 to s4 section [puppet] - 10https://gerrit.wikimedia.org/r/413335 (https://phabricator.wikimedia.org/T170662) [10:21:04] (03PS3) 10Jcrespo: mariadb: Add db2090 to s4 section [puppet] - 10https://gerrit.wikimedia.org/r/413335 (https://phabricator.wikimedia.org/T170662) [10:21:20] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413331 (owner: 10Marostegui) [10:21:41] (03CR) 10Marostegui: [C: 031] mariadb: Add db2090 to s4 section [puppet] - 10https://gerrit.wikimedia.org/r/413335 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:22:02] (03CR) 10Jcrespo: [C: 032] mariadb: Add db2090 to s4 section [puppet] - 10https://gerrit.wikimedia.org/r/413335 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:22:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1104 - T186321 (duration: 01m 14s) [10:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:45] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [10:24:42] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009#3992242 (10fgiunchedi) [10:25:00] 10Operations, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009#2804471 (10fgiunchedi) [10:28:59] (03PS1) 10Jcrespo: mariadb: disable alerts on db2073, db2090 and assign them to s4 [puppet] - 10https://gerrit.wikimedia.org/r/413337 (https://phabricator.wikimedia.org/T170662) [10:30:06] (03CR) 10Jcrespo: [C: 032] mariadb: disable alerts on db2073, db2090 and assign them to s4 [puppet] - 10https://gerrit.wikimedia.org/r/413337 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:31:20] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413338 [10:34:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413338 (owner: 10Marostegui) [10:35:56] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#3992266 (10Aklapper) > The last update of OTRS has been made in through T74109 Last major version update (to version 5) yes; latest minor update task was https://phabricator.wikim... [10:36:11] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413338 (owner: 10Marostegui) [10:37:02] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413338 (owner: 10Marostegui) [10:37:46] <_joe_> !log benchmarking EtcdConfig failure scenarios on mwdebug1001, T185078 [10:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:59] T185078: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078 [10:38:13] 10Operations, 10MediaWiki-Configuration, 10User-Joe, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3992272 (10Joe) a:03Joe [10:38:47] 10Operations, 10MediaWiki-Configuration, 10Patch-For-Review, 10User-Joe, 10discovery-system: Prepare conftool for safely editing mediawiki-config values - https://phabricator.wikimedia.org/T185080#3992276 (10Joe) a:03Joe [10:40:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 and fully repool db1104 (duration: 01m 13s) [10:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:46] !log stop db2073 for maintenance [10:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:28] (03PS1) 10Marostegui: install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) [10:57:14] (03PS2) 10Marostegui: install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) [10:59:19] (03PS3) 10Marostegui: install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) [11:04:38] (03PS10) 10Alexandros Kosiaris: Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) [11:05:06] (03PS1) 10Urbanecm: New throttle rule for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413345 (https://phabricator.wikimedia.org/T187990) [11:07:50] (03CR) 10Alexandros Kosiaris: [C: 032] Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) (owner: 10Alexandros Kosiaris) [11:13:49] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: fix typo in nginx-* version pinning for jessie [puppet] - 10https://gerrit.wikimedia.org/r/413348 (https://phabricator.wikimedia.org/T181647) [11:14:58] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: fix typo in nginx-* version pinning for jessie [puppet] - 10https://gerrit.wikimedia.org/r/413348 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [11:15:06] (03PS2) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: fix typo in nginx-* version pinning for jessie [puppet] - 10https://gerrit.wikimedia.org/r/413348 (https://phabricator.wikimedia.org/T181647) [11:16:06] 10Operations, 10media-storage: Have swift metrics available in Prometheus - https://phabricator.wikimedia.org/T187991#3992400 (10fgiunchedi) [11:21:37] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:23:17] (03CR) 10Filippo Giunchedi: [C: 031] "wrt to swift no changes needed for end user traffic" [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) (owner: 10Brion VIBBER) [11:25:26] cool, i'll resubmti without the WIP marker :D [11:26:11] brion: \o/ [11:26:43] i'm on a godawful guest network atm tho, may be a minute [11:27:26] under 1Mbit :P [11:32:04] ugh, also horrid latency [11:32:09] (03PS2) 10Brion VIBBER: gzip .stl files on transfer (application/sla) [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) [11:32:38] 500-600ms from london to eqiad? that's..... bad [11:35:13] indeed, what does traceroute or mtr have to say about it? just for giggles [11:36:11] (03PS1) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [11:36:13] (03PS1) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [11:36:30] if i'm reading this traceroute correctly it's bouncing between london and new york several times :D [11:36:37] or else it's a reeeeally inconsistent route [11:36:43] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [11:36:46] lovely [11:37:17] !log kartik@tin Started deploy [cxserver/deploy@300f728]: Update cxserver to b0404d1 [11:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:46] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [11:39:33] godog: https://gist.github.com/brion/bda3be94f6be3998e7b031b8737829a3 <- i'm not at all sure it's correct, it feels like it's returning different routes but it might well be a loop, dunno ;) [11:39:33] !log purge ORES from scb hosts T168073 T171851 [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] T168073: Switch ORES to dedicated cluster - https://phabricator.wikimedia.org/T168073 [11:39:47] T171851: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851 [11:40:14] (03PS4) 10Marostegui: install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) [11:40:37] ah it's sped up now a bit [11:40:44] * brion blames the local wifi [11:40:54] !log kartik@tin Finished deploy [cxserver/deploy@300f728]: Update cxserver to b0404d1 (duration: 03m 37s) [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:39] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: bump pinning value for kubernetes packages [puppet] - 10https://gerrit.wikimedia.org/r/413357 (https://phabricator.wikimedia.org/T181647) [11:44:27] (03PS1) 10Marostegui: db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413358 [11:44:39] (03PS1) 10Alexandros Kosiaris: profile::ores::redis::client_hosts: Remove scb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413359 [11:45:05] (03PS5) 10Marostegui: install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) [11:45:32] (03CR) 10Jcrespo: [C: 031] install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [11:45:34] (03PS2) 10Alexandros Kosiaris: profile::ores::redis::client_hosts: Remove scb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413359 [11:45:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile::ores::redis::client_hosts: Remove scb hosts [puppet] - 10https://gerrit.wikimedia.org/r/413359 (owner: 10Alexandros Kosiaris) [11:45:40] (03CR) 10Marostegui: [C: 032] install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [11:45:43] brion: yup the 3/4/5 hops are weird alright, fwiw if you are on a unix system mtr/mytraceroute is also useful [11:45:55] "it's a unix system!" [11:46:04] mtr is great yeah, i've got it via brew on mac i think [11:46:45] https://www.youtube.com/watch?v=dFUlAQZB9Ng for the nostalgic [11:46:46] yeah the route seems to have cleared up, it's stable now [11:46:52] neat [11:47:04] brion: step 6 7 8 9 10 12 and 14 are probably Equal Cost Multipath [11:47:09] aha [11:47:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413358 (owner: 10Marostegui) [11:47:14] (03PS6) 10Marostegui: install_server: Reimage db1115 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/413342 (https://phabricator.wikimedia.org/T184704) [11:47:29] but at step 12 14 there seems to be a loop or something [11:48:13] (03PS2) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: bump pinning value for kubernetes packages [puppet] - 10https://gerrit.wikimedia.org/r/413357 (https://phabricator.wikimedia.org/T181647) [11:48:24] or maybe it's just me reading it wrong [11:48:50] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413358 (owner: 10Marostegui) [11:50:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1089 and slowly repool db1076 (duration: 01m 12s) [11:50:21] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: bump pinning value for kubernetes packages [puppet] - 10https://gerrit.wikimedia.org/r/413357 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [11:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:37] (03PS2) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [11:55:39] (03PS2) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [11:59:56] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413358 (owner: 10Marostegui) [12:01:06] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team: Use external dsh group to list pooled ORES nodes - https://phabricator.wikimedia.org/T179501#3992511 (10akosiaris) [12:02:42] 10Operations, 10DBA, 10Patch-For-Review: Puppetize tendril web user creation - https://phabricator.wikimedia.org/T148955#3992515 (10jcrespo) 05Open>03Resolved a:03akosiaris I'll consider this done- at least tracked on puppet. Better handling should be a goal on itself, and solved for all services. [12:05:01] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add the helm chart for mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/410964 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris) [12:06:14] (03PS21) 10Jon Harald Søby: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [12:15:42] (03PS22) 10Jon Harald Søby: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [12:22:35] (03CR) 10Pmiazga: [C: 031] Show HTML summaries on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) (owner: 10EddieGP) [12:24:18] <_joe_> !log live-hacking ProductionServices.php on mwdebug1001 for testing (T185078) [12:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] T185078: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078 [12:26:00] (03PS3) 10Pmiazga: Show HTML summaries on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) (owner: 10EddieGP) [12:29:04] (03CR) 10Phuedx: [C: 031] Show HTML summaries on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) (owner: 10EddieGP) [12:42:07] <_joe_> !log ended live-hacking on mwdebug1001 (T185078) [12:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:21] T185078: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078 [12:46:56] (03PS3) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [12:46:58] (03PS3) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [12:47:30] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [12:47:44] jenkins give me a break... it's not a good day :) [12:48:21] (03CR) 10Jcrespo: [C: 032] wmnet: Change tendril backend to db1115 [dns] - 10https://gerrit.wikimedia.org/r/413317 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [12:49:14] (03PS4) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [12:49:16] (03PS4) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [12:52:05] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3992589 (10Physikerwelt) Is there a link to see the chart? [12:54:22] 10Operations, 10MediaWiki-Configuration, 10User-Joe, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#3992590 (10Joe) First test to fail: If I declare an invalid hostname as the etcd server in a config change, an exception is thrown and not cau... [12:58:16] (03CR) 10Volans: [C: 04-2] "[do not merge] pending MediaWiki changes to expose the etcd last index." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [13:00:44] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#3992596 (10aborrero) [13:03:59] (03PS1) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [13:05:10] (03CR) 10Volans: [C: 04-2] "[do not merge] pending I8b790669901c08fcd98337ce4c31044e1f11ffe9" [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [13:10:41] (03PS1) 10Jcrespo: tendril:Migrate target host of mariadb maintenance to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/413363 (https://phabricator.wikimedia.org/T184704) [13:11:23] (03CR) 10Jcrespo: [C: 032] tendril:Migrate target host of mariadb maintenance to db1115 [puppet] - 10https://gerrit.wikimedia.org/r/413363 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [13:11:38] (03PS2) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [13:21:46] !log upgrade pybal on lvs1003 to 1.14.4 [13:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:32] PROBLEM - PyBal connections to etcd on lvs1003 is CRITICAL: CRITICAL: 38 connections established with conf1001.eqiad.wmnet:2379 (min=41) [13:27:00] looking ^ [13:30:42] !log rebooting kubernetes1001 [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#3992672 (10faidon) > However, iptables is being replaced by nftables. It seems to me like nftables is still not very widely used (as also evidenced by the upstreams you mentioned not having adopted it yet) a... [13:40:40] (03CR) 10Filippo Giunchedi: Icinga: add sync check for MW config on etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [13:42:14] (03PS3) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [13:44:49] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3992675 (10Dzahn) @Volker_E Yes, that's right. Similar to to bmansurov's. You would ask for it to clone from Github while i can te... [13:45:32] RECOVERY - PyBal connections to etcd on lvs1003 is OK: OK: 41 connections established with conf1001.eqiad.wmnet:2379 (min=41) [13:47:05] (03PS23) 10Zoranzoki21: Change namespaces on urwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) [13:52:07] Hi [13:52:22] Please help me with: https://gerrit.wikimedia.org/r/#/c/407901/ [13:53:35] (03PS4) 10Zfilipin: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [13:54:22] (03CR) 10Zfilipin: "@Matthias Mullie: EU SWAT is in few minutes, you should remove -1 if you want this deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [13:55:33] (03PS4) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [13:56:00] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10086/" [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [13:58:30] there we go :_ [13:59:16] (03PS5) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T1400). [14:00:04] matthiasmullie and raynor: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:22] o/ - Ready [14:00:29] I can SWAT today [14:00:43] hello zeljkof, it's a small world ;) [14:00:58] :) [14:01:07] here [14:01:21] matthiasmullie and raynor: I always forget if I have already asked, do you want to deploy your change? (if you can) [14:01:42] I don't have rights to do that [14:01:55] (03CR) 10Elukey: "Currently no-op for eventlog1001 - https://puppet-compiler.wmflabs.org/compiler02/10088/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413362 (owner: 10Elukey) [14:02:10] raynor: you should fix that problem ;P [14:03:24] matthiasmullie: do you want to deploy your change? or should I? [14:03:33] it's not a problem, the less permissions I have the better sleep I get [14:03:53] raynor: developers should deploy their code, in general, I think :) [14:03:58] (03CR) 10Muehlenhoff: [WIP] eventlogging: add systemd support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413362 (owner: 10Elukey) [14:04:25] agree [14:04:32] (03CR) 10Zfilipin: [C: 031] Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [14:04:57] looks like matthiasmullie is not around, I'll start with your patch raynor then [14:05:18] (03CR) 10Elukey: "thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413362 (owner: 10Elukey) [14:05:50] cool, zeljkof we need ~15 mins to fully test it [14:06:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) (owner: 10EddieGP) [14:06:37] (03PS6) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [14:06:58] (03CR) 10Elukey: [C: 031] cassandra: use prometheus-jmx-exporter Debian package [puppet] - 10https://gerrit.wikimedia.org/r/402069 (https://phabricator.wikimedia.org/T181728) (owner: 10Filippo Giunchedi) [14:07:38] (03CR) 10Volans: [C: 031] "LGTM. I still would prefer to install the script without .py, but is an old discussion and basically personal preference." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [14:07:40] (03Merged) 10jenkins-bot: Show HTML summaries on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) (owner: 10EddieGP) [14:07:48] (03CR) 10jenkins-bot: Show HTML summaries on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396318 (https://phabricator.wikimedia.org/T182321) (owner: 10EddieGP) [14:07:51] I'm here - I can deploy myself [14:07:56] let me know when you're ready [14:08:06] (was having connectivity issues - on different network now) [14:08:21] matthiasmullie: sorry, already started with the other commit, will ping you in 15 minutes or so [14:09:43] no worries, take your time :) [14:09:53] raynor: your patch is at mwdebug1002, please test and let me know if I can deploy [14:10:10] ok, I'm testing that with Olga, we will try to test it as fast as possible [14:13:16] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [14:14:02] zeljkof - so far it looks good, we need bit more time to verify some articles. Looks promising [14:14:19] * zeljkof thumbs up [14:19:32] (03PS1) 10Jcrespo: tendril: MariaDB fixes and tunings [puppet] - 10https://gerrit.wikimedia.org/r/413368 (https://phabricator.wikimedia.org/T184704) [14:21:06] (03PS1) 10Elukey: role::cache::canary|misc: remove testing vk instance [puppet] - 10https://gerrit.wikimedia.org/r/413370 (https://phabricator.wikimedia.org/T185136) [14:22:34] (03CR) 10Zfilipin: Change namespaces on urwiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407901 (https://phabricator.wikimedia.org/T186393) (owner: 10Zoranzoki21) [14:22:57] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10089/" [puppet] - 10https://gerrit.wikimedia.org/r/402069 (https://phabricator.wikimedia.org/T181728) (owner: 10Filippo Giunchedi) [14:23:26] (03CR) 10Ottomata: [C: 031] role::cache::canary|misc: remove testing vk instance [puppet] - 10https://gerrit.wikimedia.org/r/413370 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [14:25:04] raynor: just checking on progress, do you need more time to test? [14:25:26] zeljkof: ok, we're good [14:25:31] please deploy that to production [14:25:35] raynor: deploying [14:25:50] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3992735 (10akosiaris) Yes. We don't have a proper helm repo yet (TBD) but the chart itself is here https://gerrit.wikimedia.org/g/operations/deplo... [14:26:39] (03PS5) 10Matthias Mullie: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) [14:26:53] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:396318|Show HTML summaries on cswiki (T182321)]] (duration: 01m 13s) [14:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:09] T182321: Show HTML summaries on cswiki - https://phabricator.wikimedia.org/T182321 [14:27:29] raynor: deployed; please check and thanks for deploying with #releng ;) [14:27:36] matthiasmullie: swat is yours! :) [14:27:37] (03PS3) 10BBlack: gzip .stl files on transfer (application/sla) [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) (owner: 10Brion VIBBER) [14:28:06] (03CR) 10BBlack: [C: 032] gzip .stl files on transfer (application/sla) [puppet] - 10https://gerrit.wikimedia.org/r/413236 (https://phabricator.wikimedia.org/T187930) (owner: 10Brion VIBBER) [14:28:21] okay thanks [14:28:36] matthiasmullie: please close SWAT window when you are done [14:29:25] don't let any more SWAT in! [14:29:45] godog: something going on? [14:29:56] zeljkof: wait - what does that mean? :) [14:29:56] zeljkof: sorry, no I was kidding [14:29:59] (03CR) 10Matthias Mullie: [C: 032] Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [14:30:01] (03CR) 10Ottomata: "Looks good https://puppet-compiler.wmflabs.org/compiler02/10090/cp1045.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413243 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [14:30:08] (03PS2) 10Ottomata: Produce webrequest_misc logs to Kafka jumbo instead of Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/413243 (https://phabricator.wikimedia.org/T185136) [14:30:09] godog: uh :D [14:30:39] godog: please remember, "serious stuff" ;P [14:31:04] haha indeed, I should have put a smiley there #fail [14:31:23] #emoticonfail [14:31:28] we should have a webcast of a SWAT deploy some day. so that it can be live swat'ed [14:31:49] zeljkof: what does "close the window" mean? anything in particular I need to do? [14:31:50] thedj: +1 [14:31:56] heh [14:32:00] it would mostly be me in a dark room looking at a terminal screen :) [14:32:04] (03Merged) 10jenkins-bot: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [14:32:10] matthiasmullie: !log EU SWAT finished :) [14:32:16] "i'm deploying the patch now. wait what's going on ? I didin't do anything" :) [14:32:18] that is all [14:32:18] (03CR) 10jenkins-bot: Load 3D extension on other wikis, for display only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410433 (https://phabricator.wikimedia.org/T187261) (owner: 10Matthias Mullie) [14:32:35] zeljkof: thanks for deployment. Everything went super smooth \o/ [14:32:45] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2171.codfw.wmnet [14:32:48] raynor: glad I could help! [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:04] oh alright :p [14:33:52] moritzm: ^^^ as before :) [14:35:53] bblack: confirmed it's working! \o/ [14:37:03] man i love transparent compression [14:37:14] makes all these horrible file formats more palatable on the network ;) [14:37:52] STL is mostly just a bunch of 32-bit floats in raw binary heheh [14:38:07] * mark blames brion for our decreasing network traffic [14:38:26] if you need more budget i can make the files bigger [14:38:27] please make up for it with more multimedia/video [14:38:29] ssst. before you know it he wants to deploy 4K video [14:38:32] >:D [14:38:39] please do [14:38:52] !log mlitn@tin Synchronized wmf-config/InitialiseSettings.php: Enable 3D file display (duration: 01m 13s) [14:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:13] wee it works ! 3D embedding [14:40:04] !log mlitn@tin scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:15] !log beginning migration of webrequest_misc from Kafka analytics to jumbo: T185136 [14:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:27] T185136: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136 [14:41:33] hmm, we should add some metadata to the patent template.... [14:42:03] (03CR) 10Ottomata: [C: 032] Produce webrequest_misc logs to Kafka jumbo instead of Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/413243 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [14:45:08] > Fatal error: Uncaught exception 'Exception' with message '3d requires MultimediaViewer to be installed.' [14:45:38] matthiasmullie: zeljkof ^ FYI [14:45:44] is the error on the canary servers [14:46:02] working on it! [14:46:06] we have servers without MMV ? [14:49:03] thcipriani|afk: uh oh, but looks like matthiasmullie is on it [14:49:57] cool, just saw the alert and jumped :) [14:50:28] !log mlitn@tin Synchronized php-1.31.0-wmf.21/extensions/3D/extension.json: Remove MMV dependency for 3D (duration: 01m 12s) [14:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:42] !log mlitn@tin Synchronized wmf-config/CommonSettings.php: Enable 3D file display (duration: 01m 12s) [14:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:57] yes [14:55:06] (03PS1) 10Jcrespo: mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) [14:55:07] i mean how cool is this: https://commons.wikimedia.org/wiki/File:ISS_2016.stl#/media/File:ISS_2016.stl [14:55:21] ^"Do you really want to submit the above commits?" yes/no [14:55:47] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [14:55:56] jynus: lol [14:57:22] I guess I should have said no [14:57:57] (03PS5) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [14:57:59] (03PS5) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [14:58:01] (03CR) 10Volans: [C: 04-2] "Thanks for the review, changes inline. (still -2)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [14:58:59] !log EU SWAT finished [14:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:13] (03CR) 10Ottomata: "Wonder if we could also have a parent 'eventlogging-processors?' that we could use to stop/start just all processors?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413362 (owner: 10Elukey) [15:05:27] (03PS2) 10Jcrespo: tendril: MariaDB fixes and tunings [puppet] - 10https://gerrit.wikimedia.org/r/413368 (https://phabricator.wikimedia.org/T184704) [15:05:29] (03PS2) 10Jcrespo: mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) [15:06:10] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [15:08:07] (03PS4) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [15:09:19] (03CR) 10Jcrespo: [C: 032] tendril: MariaDB fixes and tunings [puppet] - 10https://gerrit.wikimedia.org/r/413368 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [15:10:12] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:13] PROBLEM - Varnish frontend child restarted on cp5001 is CRITICAL: CRITICAL - varnish-frontend-check-child-start is 8 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5001&var-datasource=eqsin+prometheus/ops [15:11:00] that's me ^ [15:11:48] (03PS1) 10Ottomata: Fix logrotate file for ops webrequest kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/413382 (https://phabricator.wikimedia.org/T187890) [15:12:48] (03PS3) 10Jcrespo: mariadb: Fix and standarize firewall holes to all cloud-related mariadbs [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) [15:14:13] RECOVERY - Varnish frontend child restarted on cp5001 is OK: OK - varnish-frontend-check-child-start is 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5001&var-datasource=eqsin+prometheus/ops [15:16:11] win 65 [15:18:44] (03CR) 10Elukey: "pcc happy, list of things to remove:" [puppet] - 10https://gerrit.wikimedia.org/r/413370 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [15:18:51] (03PS2) 10Elukey: role::cache::canary|misc: remove testing vk instance [puppet] - 10https://gerrit.wikimedia.org/r/413370 (https://phabricator.wikimedia.org/T185136) [15:19:04] (03PS1) 10Ottomata: Rotate kafkatee instance stats files [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413383 (https://phabricator.wikimedia.org/T187890) [15:19:48] (03PS2) 10Ottomata: Rotate kafkatee instance stats files [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413383 (https://phabricator.wikimedia.org/T187890) [15:20:39] (03CR) 10Ottomata: [C: 032] Rotate kafkatee instance stats files [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/413383 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [15:21:09] (03CR) 10Ema: [C: 031] role::cache::canary|misc: remove testing vk instance [puppet] - 10https://gerrit.wikimedia.org/r/413370 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [15:22:14] (03CR) 10Elukey: [C: 032] role::cache::canary|misc: remove testing vk instance [puppet] - 10https://gerrit.wikimedia.org/r/413370 (https://phabricator.wikimedia.org/T185136) (owner: 10Elukey) [15:24:15] (03PS2) 10Ottomata: Fix logrotate files for ops webrequest kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/413382 (https://phabricator.wikimedia.org/T187890) [15:24:40] !log manually removing from cp1008 and cache::misc old files related to the varnishkafka jumbo testing instance (after https://gerrit.wikimedia.org/r/413370) [15:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:14] (03PS3) 10Ottomata: Fix logrotate files for ops webrequest kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/413382 (https://phabricator.wikimedia.org/T187890) [15:30:35] (03PS5) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [15:30:41] (03PS6) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [15:31:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [15:32:43] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10096/" [puppet] - 10https://gerrit.wikimedia.org/r/413382 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [15:32:48] (03CR) 10Ottomata: [C: 032] Fix logrotate files for ops webrequest kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/413382 (https://phabricator.wikimedia.org/T187890) (owner: 10Ottomata) [15:34:48] (03PS1) 10Muehlenhoff: Add library hints for GCC (which ships a ton of low level libs) [puppet] - 10https://gerrit.wikimedia.org/r/413389 [15:36:10] (03PS2) 10Muehlenhoff: Add library hints for GCC (which ships a ton of low level libs) [puppet] - 10https://gerrit.wikimedia.org/r/413389 [15:37:04] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3992883 (10Volker_E) @Dzahn Just to be explicit about our latest structure. We plan to have an index page in design.wikimedia.org... [15:37:39] (03CR) 10Muehlenhoff: [C: 032] Add library hints for GCC (which ships a ton of low level libs) [puppet] - 10https://gerrit.wikimedia.org/r/413389 (owner: 10Muehlenhoff) [15:38:48] (03PS7) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [15:39:32] (03CR) 10jerkins-bot: [V: 04-1] labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:40:13] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:14] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3992904 (10Papaul) @Marostegui Disk in slot 5 is blinking. [15:43:08] (03PS8) 10Andrew Bogott: labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) [15:43:54] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10bmansurov) Unsubscribing, just to keep the noise down. Ping me if you need anything from me. [15:46:36] (03CR) 10Andrew Bogott: [C: 032] labweb: install nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413275 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [15:47:40] (03PS3) 10Ppchelko: Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) [15:50:05] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3992982 (10jcrespo) db2030 or db2037? [15:58:58] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#3992596 (10jcrespo) I am agnostic about this, do not know enough differences- arguably, I think we should better puppetize the current rules- but I have never found problems with the backends to understand t... [15:59:24] 10Operations, 10HHVM, 10MW-1.31-release-notes (WMF-deploy-2018-02-13 (1.31.0-wmf.21)), 10Performance-Team (Radar): HHVM hangs on the API cluster - https://phabricator.wikimedia.org/T184048#3993001 (10elukey) [16:02:04] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3993005 (10RobH) 05Open>03stalled This has now sat without feedback since Monday. Please... [16:02:43] (03PS1) 10Jcrespo: tendril: Enable mysql binlog with format ROW [puppet] - 10https://gerrit.wikimedia.org/r/413393 (https://phabricator.wikimedia.org/T184704) [16:04:01] (03CR) 10Jcrespo: [C: 032] tendril: Enable mysql binlog with format ROW [puppet] - 10https://gerrit.wikimedia.org/r/413393 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [16:04:12] (03PS1) 10Andrew Bogott: labweb nutcracker: re-use profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413394 (https://phabricator.wikimedia.org/T187506) [16:07:05] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993031 (10Papaul) No blinking disk on db2037 all looks good. And ILO is working on my end [16:07:47] (03PS7) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [16:10:33] 10Operations, 10Mathoid, 10Prod-Kubernetes, 10Kubernetes, and 3 others: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3993065 (10Physikerwelt) Ok. I was just asking because I will be giving a talk on Math rendering in Wikipedia on Mar. 8th and it would be nice to... [16:11:44] (03PS1) 10Andrew Bogott: labweb nutcracker: further attempt to pass in memcached_pools correctly [puppet] - 10https://gerrit.wikimedia.org/r/413396 (https://phabricator.wikimedia.org/T187506) [16:12:38] I am going to put down tendril again [16:12:52] and dbtree [16:12:53] (03CR) 10Bstorm: [C: 032] tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [16:13:04] (03PS6) 10Bstorm: tools-static: Change to reverse proxy of cdnjs [puppet] - 10https://gerrit.wikimedia.org/r/413197 (https://phabricator.wikimedia.org/T182604) [16:13:10] !log tendril and dbtree database currently under maintanance [16:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:15] (03PS2) 10Andrew Bogott: labweb nutcracker: further attempt to pass in memcached_pools correctly [puppet] - 10https://gerrit.wikimedia.org/r/413396 (https://phabricator.wikimedia.org/T187506) [16:17:12] (03CR) 10Mobrovac: [C: 032] Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:17:14] (03PS1) 10Muehlenhoff: Switch debdeploy clients to Python 3 (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 [16:18:04] (03PS2) 10Cmjohnson: Adding mgmt and production dns [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073) [16:19:13] (03Merged) 10jenkins-bot: Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:19:24] (03CR) 10jenkins-bot: Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:20:27] (03PS3) 10Cmjohnson: Adding mgmt and production dns [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073) [16:21:16] (03CR) 10Cmjohnson: [C: 032] Adding mgmt and production dns [dns] - 10https://gerrit.wikimedia.org/r/413230 (https://phabricator.wikimedia.org/T186073) (owner: 10Cmjohnson) [16:21:36] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2073 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413398 [16:22:05] !log mobrovac@tin scap failed: average error rate on 8/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [16:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:06] !log ppchelko@tin Started deploy [cpjobqueue/deploy@ab3d002]: Enable refreshLinks for group0 wikis T185052 [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:20] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [16:23:21] known, fixing ^ [16:23:30] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2073 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413398 (owner: 10Jcrespo) [16:23:42] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@ab3d002]: Enable refreshLinks for group0 wikis T185052 (duration: 00m 36s) [16:23:52] wow are you guys migrating refreshlinks?? [16:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:47] elukey: only for test wikis [16:24:59] so not even 5% of the overall traffic of refreshlinks [16:25:11] small steps :) [16:25:34] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Use EventBus for refreshLinks in test wikis, file 1/2 - T185052 (duration: 01m 12s) [16:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:03] (03PS3) 10Andrew Bogott: labweb nutcracker: further attempt to pass in memcached_pools correctly [puppet] - 10https://gerrit.wikimedia.org/r/413396 (https://phabricator.wikimedia.org/T187506) [16:26:05] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:26:11] (03CR) 10jerkins-bot: [V: 04-1] labweb nutcracker: further attempt to pass in memcached_pools correctly [puppet] - 10https://gerrit.wikimedia.org/r/413396 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [16:26:50] did someone reboot db2037? [16:26:54] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Use EventBus for refreshLinks in test wikis, file 2/2 - T185052 (duration: 01m 12s) [16:26:58] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2073 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413398 (owner: 10Jcrespo) [16:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:24] PROBLEM - Host db2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:32] marostegui, papaul^ [16:29:28] managament is also unaccesible [16:30:11] yeah [16:30:31] papaul is doing the power drain as far as I know [16:32:38] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413400 [16:33:54] (03PS2) 10Andrew Bogott: labweb nutcracker: re-use profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413394 (https://phabricator.wikimedia.org/T187506) [16:37:30] (03PS3) 10Andrew Bogott: labweb nutcracker: re-use profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413394 (https://phabricator.wikimedia.org/T187506) [16:37:54] RECOVERY - Host db2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.61 ms [16:37:58] (03PS1) 10Jcrespo: tendril: Fix for server to start correcty [puppet] - 10https://gerrit.wikimedia.org/r/413401 (https://phabricator.wikimedia.org/T184704) [16:38:20] (03CR) 10Andrew Bogott: [C: 032] labweb nutcracker: re-use profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413394 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [16:39:47] (03PS2) 10Jcrespo: tendril: Fix for server to start correcty [puppet] - 10https://gerrit.wikimedia.org/r/413401 (https://phabricator.wikimedia.org/T184704) [16:40:47] (03CR) 10Jcrespo: [C: 032] tendril: Fix for server to start correcty [puppet] - 10https://gerrit.wikimedia.org/r/413401 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [16:40:50] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3993152 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [16:45:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413400 (owner: 10Marostegui) [16:47:08] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413400 (owner: 10Marostegui) [16:47:18] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413400 (owner: 10Marostegui) [16:47:44] (03CR) 10Mobrovac: "Hm, so looking at the chart and the current Mathoid config, we will have to find a way to include more complex config variables (like YAML" [deployment-charts] - 10https://gerrit.wikimedia.org/r/410964 (https://phabricator.wikimedia.org/T184919) (owner: 10Alexandros Kosiaris) [16:49:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1076 (duration: 01m 12s) [16:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:43] (03PS8) 10Elukey: [WIP] eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 [16:51:54] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:52:34] 10Operations, 10ops-codfw, 10DBA: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#3993182 (10Marostegui) Thanks a lot! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 9% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) physicaldriv... [16:53:44] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#3993185 (10aborrero) >>! In T187994#3992995, @jcrespo wrote: > I am agnostic about this, do not know enough differences- arguably, I think we should better puppetize the current rules- but I have never found... [16:55:24] (03PS1) 10Elukey: role::aqs: enable Cassandra JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/413405 (https://phabricator.wikimedia.org/T184795) [16:55:41] no_justification: any chance of gerrit editor supporting tabs (not spaces)? [16:55:55] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [16:56:21] doesn't it already? [16:58:32] !log redirecting ns2 traffic to radon [16:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:18] (03CR) 10Elukey: "Is it that easy???" [puppet] - 10https://gerrit.wikimedia.org/r/413405 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [16:59:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993212 (10Papaul) a:05Papaul>03Marostegui 1- Power drain 2- Update all firmware on the system. [17:00:04] godog, moritzm, and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:02:13] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [17:02:22] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3993221 (10Papaul) [17:02:26] !log reboot eeden with new kernel 4.9.0-0.bpo.6 [17:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:14] AaronSchulz hi, yes. [17:04:24] AaronSchulz it should be in the edit preference. [17:04:54] XioNoX: eeden back online already [17:05:33] that's fast [17:05:37] yep! [17:05:42] rolling back router changes [17:05:59] !log rolling back "redirecting ns2 traffic to radon" [17:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:30] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993231 (10Marostegui) 1- The ipmi is still not working :-( 2- Thanks! Also thanks for checking the disks, the system boot up finely, wh... [17:08:07] (03CR) 10Marostegui: [C: 031] "db2037 looks healthy after the check on site (apart from IPMI, which shouldn't be a blocker)" [puppet] - 10https://gerrit.wikimedia.org/r/412964 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:09:03] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3993252 (10Marostegui) [17:09:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3993251 (10Marostegui) 05Open>03Resolved [17:10:17] (03PS1) 10EBernhardson: Resize the Cirrus LTR model cache [puppet] - 10https://gerrit.wikimedia.org/r/413407 (https://phabricator.wikimedia.org/T188015) [17:10:52] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4030 is CRITICAL: connect to address 10.128.0.130 and port 3128: Connection refused [17:10:56] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3317112 (10Deskana) [17:11:43] 10Operations, 10ops-codfw, 10monitoring: db2037 IPMI not working - https://phabricator.wikimedia.org/T188016#3993295 (10Marostegui) p:05Triage>03Normal [17:11:52] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4030 is OK: HTTP OK: HTTP/1.1 200 OK - 219 bytes in 0.157 second response time [17:11:58] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030) - https://phabricator.wikimedia.org/T187722#3983430 (10Marostegui) Tracking task for the IPMI issue: T188016 [17:12:39] marostegui: https://en.wikipedia.org/wiki/C_Sharp_syntax gives database error [17:12:45] other enwiki pages seems to work fine [17:12:54] you all aware? [17:13:17] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413412 [17:13:33] (03PS1) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [17:14:04] (03CR) 10jerkins-bot: [V: 04-1] labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [17:14:10] I get a Timeout waiting for the lock [17:14:13] Hauskatze: ^ [17:14:15] me too [17:14:22] Sorry, the servers are overloaded at the moment. [17:14:22] Too many users are trying to view this page. Please wait a while before you try to access this page again. [17:14:22] Timeout waiting for the lock [17:14:33] an user just reported this on -tech [17:14:40] it's weird that page only though [17:14:58] Hauskatze: I don't see any relevant errors on our logtash [17:15:10] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3993367 (10Jdforrester-WMF) [17:15:12] indeed, I might be misremembering that might be poolcounter too [17:15:35] (03PS2) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [17:15:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413412 (owner: 10Marostegui) [17:16:13] (03CR) 10jerkins-bot: [V: 04-1] labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [17:16:14] worth opening a Task? [17:16:25] Hauskatze: yes please, I'm taking a look in the meantime [17:16:40] godog: okay, UBN or normal? [17:17:16] Hauskatze: normal until we know more [17:17:20] i now get [17:17:21] Request from 2a00:23c4:ad0a:7d01:3027:e860:dfb:512e via cp1052 cp1052, Varnish XID 882915698 [17:17:21] Error: 503, Backend fetch failed at Thu, 22 Feb 2018 17:16:17 GMT [17:17:22] ok doing [17:17:41] (03CR) 10Gehel: "LGTM (minor comment inline). Puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/10110/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413407 (https://phabricator.wikimedia.org/T188015) (owner: 10EBernhardson) [17:17:54] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3993403 (10Jdforrester-WMF) [17:17:56] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413412 (owner: 10Marostegui) [17:18:03] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413412 (owner: 10Marostegui) [17:18:29] `Pool key 'enwiki:pcache:idhash:13646669-0!canonical:revid:814037984' (ArticleView): Timeout waiting for the lock` in logstash, helpful? [17:18:37] (03PS1) 10Marostegui: install_serer: Do not format any db in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/413415 (https://phabricator.wikimedia.org/T184704) [17:19:06] (03PS2) 10Marostegui: install_server: Do not format any db in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/413415 (https://phabricator.wikimedia.org/T184704) [17:19:07] https://phabricator.wikimedia.org/T188019 [17:19:12] feel free to add there [17:19:19] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo outage [17:19:19] ACKNOWLEDGEMENT - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo outage [17:20:31] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#3993437 (10aborrero) [17:20:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1076 (duration: 01m 12s) [17:20:54] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3993442 (10Papaul) [17:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:17] yeah afaics it is that revid causing errors, but poolcounter timeouts levels are about normal [17:22:11] (03CR) 10EBernhardson: Resize the Cirrus LTR model cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413407 (https://phabricator.wikimedia.org/T188015) (owner: 10EBernhardson) [17:22:15] (03PS2) 10EBernhardson: Resize the Cirrus LTR model cache [puppet] - 10https://gerrit.wikimedia.org/r/413407 (https://phabricator.wikimedia.org/T188015) [17:23:15] !log installed linux-perf-4.9 on phab1001 to experiment with perf tracing [17:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [17:24:32] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [17:25:13] Hauskatze: yeah doesn't seem to be a widespread issue afaics, I have to go tho [17:25:28] (03PS1) 10Jcrespo: tendril: Revert configuration to tendril's original one [puppet] - 10https://gerrit.wikimedia.org/r/413417 (https://phabricator.wikimedia.org/T184704) [17:25:51] ok godog, take care; is there anyone else taking care of the issue? [17:26:16] (03CR) 10Jcrespo: [C: 032] tendril: Revert configuration to tendril's original one [puppet] - 10https://gerrit.wikimedia.org/r/413417 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [17:26:21] (03PS2) 10Jcrespo: tendril: Revert configuration to tendril's original one [puppet] - 10https://gerrit.wikimedia.org/r/413417 (https://phabricator.wikimedia.org/T184704) [17:26:27] (03CR) 10Jcrespo: [V: 032 C: 032] tendril: Revert configuration to tendril's original one [puppet] - 10https://gerrit.wikimedia.org/r/413417 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [17:26:32] Hauskatze: don't know for sure [17:26:33] (03PS3) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [17:26:40] kk [17:26:43] :) [17:27:10] (03CR) 10jerkins-bot: [V: 04-1] labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [17:27:21] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3993475 (10Cmjohnson) [17:28:08] TheresNoTime: still got logstash opened? [17:28:14] yah [17:28:47] I'm getting an error code at T188014 -- if you could check it for me? [17:28:48] T188014: MassMessage doesn't work on dty.wikipedia - https://phabricator.wikimedia.org/T188014 [17:28:50] <_joe_> elukey: perf on phab will not do much [17:29:12] <_joe_> as php has no jit and the calls will be all to zend_execute IIRC [17:29:30] Hauskatze: sure thing, looking now [17:29:35] _joe_ yep yep it was just to experiment, I am not hoping much :) [17:29:42] we already have good stack traces [17:30:01] thanky [17:30:31] (03PS4) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [17:31:00] Hauskatze: want the trace on the phab, or PM? [17:31:41] TheresNoTime: on the Phab task if possible [17:32:50] done ^ [17:34:58] TheresNoTime: thanks! I think that this is as I expected, a conflict on the usernames there [17:35:33] Not quite sure what I'm looking at there, but `Did you forget to add your User object to the database before calling addGroup()?` suggests you're right :) [17:36:03] (03PS5) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [17:42:04] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3993564 (10Cmjohnson) @cwdent or @Jgreen I need to know which network switch you prefer for this server fasw-c1a or c1b? [17:42:35] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3993566 (10Cmjohnson) [17:42:46] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3932868 (10Cmjohnson) @cwdent or @Jgreen I need to know which network switch you prefer for this server fasw-c1a or c1b? [17:43:04] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3993568 (10Cmjohnson) [17:43:11] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#3973195 (10Cmjohnson) @cwdent or @Jgreen I need to know which network switch you prefer for this server fasw-c1a or c1b? [17:43:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3993572 (10Cmjohnson) [17:43:31] (03CR) 10Smalyshev: wdqs: allow configuration of kafka based updates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412873 (https://phabricator.wikimedia.org/T185951) (owner: 10Gehel) [17:43:41] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#3973216 (10Cmjohnson) @cwdent or @Jgreen I need to know which network switch you prefer for this server fasw-c1a or c1b? [17:43:51] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#3993581 (10Cmjohnson) [17:44:14] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#3993583 (10Cmjohnson) [17:44:29] (03CR) 10Andrew Bogott: [C: 031] "This all looks reasonable to me. I'd like someone more familiar with labs/db/replica.pp to sign off on that bit as well." [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [17:54:21] (03PS6) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [17:56:29] (03CR) 10Jcrespo: [C: 031] "Will add extra testing (puppet compiler) and deployed tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/413375 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [17:58:22] (03PS1) 10Jcrespo: install_server: Cleanup so no db server can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/413423 (https://phabricator.wikimedia.org/T184704) [17:59:08] (03CR) 10Jcrespo: "Revert or patch as needed; I just wanted to reset to a "normal" state for a second." [puppet] - 10https://gerrit.wikimedia.org/r/413423 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:21] no parsoid deploy today [18:00:52] (03PS2) 10Jcrespo: install_server: Cleanup so no db server can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/413423 (https://phabricator.wikimedia.org/T184704) [18:01:13] (03CR) 10Jcrespo: [C: 032] install_server: Cleanup so no db server can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/413423 (https://phabricator.wikimedia.org/T184704) (owner: 10Jcrespo) [18:03:40] (03PS7) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [18:08:44] (03PS8) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [18:13:58] (03Abandoned) 10Marostegui: install_server: Do not format any db in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/413415 (https://phabricator.wikimedia.org/T184704) (owner: 10Marostegui) [18:16:14] (03PS9) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [18:19:20] (03PS10) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [18:21:58] (03CR) 10Andrew Bogott: [C: 032] labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [18:22:04] (03PS11) 10Andrew Bogott: labweb nutcracker: add non-functional redis host list [puppet] - 10https://gerrit.wikimedia.org/r/413413 (https://phabricator.wikimedia.org/T187506) [18:24:15] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3993764 (10Gehel) I need to check, but I think the metric needed for OSM replication lag are still good. I'll just need to cherry pick this patch and... [18:30:02]  [18:31:06] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3993807 (10Dzahn) >>! In T185282#3992883, @Volker_E wrote: > We plan to have an index page in design.wikimedia.org and the style g... [18:32:44] (03PS4) 10Dzahn: introduce kafkamon1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) [18:32:51] (03CR) 10Muehlenhoff: "This looks good and seems ready for some testing on eventlog1002 IMO." [puppet] - 10https://gerrit.wikimedia.org/r/413362 (owner: 10Elukey) [18:33:33] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3993819 (10Volker_E) @Dzahn Ok, good to know. Basically, it needs to be two repos, while the one (style guide) is specifically tar... [18:35:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3993829 (10Smalyshev) [18:37:07] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3993848 (10RobH) [18:40:53] (03PS1) 10Ayounsi: DNS: redirect states from ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413431 [18:41:14] 10Operations, 10Goal, 10HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#3993863 (10bd808) [18:41:41] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3993865 (10RobH) a:05RobH>03Gehel So in looking at wdqs2001-2003, they will be under warranty/in service for year(s) yet. So your plan to space the new 3 systems evenly across the 4 rows with the existin... [18:46:45] (03CR) 10BBlack: [C: 031] DNS: redirect states from ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413431 (owner: 10Ayounsi) [18:47:10] (03CR) 10Ayounsi: [C: 032] DNS: redirect states from ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413431 (owner: 10Ayounsi) [18:49:18] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3993886 (10Gehel) Those new systems are for a new cluster, independent of the current on (T178492). So we don't have any need to spread the failure domains across both the old and the new servers. So yes, the... [18:49:28] 10Operations, 10ops-codfw: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#3993888 (10Gehel) a:05Gehel>03Papaul [18:51:21] (03PS3) 10Jcrespo: install_server: Cleanup so no db server can be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/413423 (https://phabricator.wikimedia.org/T184704) [18:57:55] 10Operations, 10Puppet: naggen2: support puppetdb 4 settings and api - https://phabricator.wikimedia.org/T188032#3993941 (10herron) p:05Triage>03Normal [18:58:09] (03PS1) 10Herron: naggen2: support puppetdb 4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) [19:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T1900). Please do the needful. [19:00:04] twkozlowski: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:17] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2073 (duration: 01m 12s) [19:00:31] * odder waves [19:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:56] note there is some enwiki breakage going on, SWAT should probably be on hold [19:02:01] (03PS2) 10Rush: openstack: labs-instance-transport1-b-codfw designations [dns] - 10https://gerrit.wikimedia.org/r/413160 (https://phabricator.wikimedia.org/T184209) [19:02:09] what tgr said [19:03:15] !log baham:~# authdns-update [19:03:20] sorry for the deploy, I got distracted because breakage, but I didn't want to leave something merged but undeployed [19:03:28] greg-g: I had this scheduled twice already, the first time round no one was there to do the SWAT, and I wasn't there the second time last week [19:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:46] greg-g: To be honest, I don't even need to be here, these are just simple swaps for the logos/favicons [19:04:14] So if you guys could just merge and deploy it whenever you've got time, it should be fine [19:04:28] s/it/them apples/ [19:04:37] <_joe_> yes, SWAT is on hold [19:04:49] <_joe_> I hereby declare :) [19:05:03] <_joe_> we have a serious breakage going on, not just on enwiki btw [19:05:06] (03PS1) 10Ayounsi: Revert "DNS: redirect states from ulsfo to codfw" [dns] - 10https://gerrit.wikimedia.org/r/413438 [19:05:07] _joe_: `scap lock` is your friend :) [19:05:13] If you want to enforce this [19:05:31] <_joe_> no_justification: no need I trust people will behave :) [19:05:44] <_joe_> chasemp: thanks [19:05:49] _joe_: well, not everyone checks in with IRC before deploying ;-) [19:05:51] np :) [19:05:56] (this is a problem anyway, but I digress) [19:06:16] <_joe_> no_justification: well anyone in a swat window should, right? [19:06:24] True true [19:06:31] (I'm thinking more rogue deploy of X service) [19:06:36] (hence `scap lock --global` [19:06:40] <_joe_> odder: tbh, we'll also need to do a full rolling restart of the appservers after the deploy (yes, it's that bad) [19:06:46] <_joe_> so it will take some time, sorry [19:06:47] <_joe_> :( [19:06:59] _joe_: No worries, would it be fine if I moved it to the evening window? [19:07:20] I won't be able to make it though due to the time difference [19:07:32] <_joe_> I'm the wrong person to ask about that :) [19:07:35] <_joe_> greg-g: ^^ [19:08:05] odder: Link to patch [19:08:21] https://gerrit.wikimedia.org/r/#/c/410201/ [19:08:25] I could maybe handle it depending on what you ask :) [19:08:31] https://gerrit.wikimedia.org/r/#/c/406624/ [19:08:45] These two are fairly easy, no_justification [19:08:52] Trivial [19:09:01] I'll get 'er dun [19:09:16] https://gerrit.wikimedia.org/r/#/c/402618/ is the one that rewrites all of the favicons we're currently serving [19:09:26] !log syncing https://gerrit.wikimedia.org/r/#/c/413437/ [19:09:34] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.21/extensions/SyntaxHighlight_GeSHi: deploy https://gerrit.wikimedia.org/r/#/c/413437/ (duration: 01m 13s) [19:09:34] (03PS1) 10Jcrespo: mariadb: Pool db2090 for the first time on s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413439 (https://phabricator.wikimedia.org/T170662) [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:43] Due to https://phabricator.wikimedia.org/T177726 no_justification [19:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:01] (03CR) 10Jcrespo: [C: 04-2] "To check and deploy tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413439 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [19:12:09] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db2073 and db2090 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/413440 [19:12:36] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.22/extensions/SyntaxHighlight_GeSHi: deploy https://gerrit.wikimedia.org/r/#/c/413437/ (duration: 01m 13s) [19:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:41] (03PS13) 10Rush: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [19:13:52] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db2073 and db2090 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/413440 (owner: 10Jcrespo) [19:14:30] thanks twentyafterfour [19:14:41] !log rolling restart of eqiad appservers. sudo cumin -b3 -s 30 'A:mw-eqiad' 'restart-hhvm' T188019 [19:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:55] T188019: English Wikipedia page gives lock timeout - https://phabricator.wikimedia.org/T188019 [19:15:13] greg-g: no problem! [19:17:27] (03CR) 10Rush: [C: 032] shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [19:18:29] no_justification: So I'll leave those two with you, hope that's fine [19:18:52] (03CR) 10Rush: [V: 032 C: 032] shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [19:19:01] For the third one, if you want to delay it, just let me know via a comment on Gerrit and I'll wait for a better time [19:24:21] (03PS14) 10Rush: shinken: WMCS: use sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [19:29:34] (03PS1) 10Ayounsi: DNS: redirect FB's main source of queries from ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413446 [19:30:05] (03CR) 10BBlack: [C: 031] DNS: redirect FB's main source of queries from ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413446 (owner: 10Ayounsi) [19:32:08] (03CR) 10Ayounsi: [C: 032] DNS: redirect FB's main source of queries from ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413446 (owner: 10Ayounsi) [19:33:21] !log redirecting Facebook bots large source of traffic to codfw ( https://gerrit.wikimedia.org/r/#/c/413446/ ) [19:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:03] (03Abandoned) 10BryanDavis: labsdb: Remove obsolete mediawiki-config submodule [puppet] - 10https://gerrit.wikimedia.org/r/413095 (https://phabricator.wikimedia.org/T187850) (owner: 10BryanDavis) [19:44:46] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 34.27, 34.14, 32.10 [19:44:56] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 47.65, 42.46, 40.25 [19:45:03] no_justification: twentyafterfour akosiaris I presume the rolling restart is near done, can we resume normal activities? [19:45:20] I have no idea [19:48:47] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 41.69, 41.82, 40.12 [19:48:57] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 41.29, 40.96, 40.04 [19:50:16] PROBLEM - HHVM rendering on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [19:50:57] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 40.62, 40.64, 40.01 [19:51:16] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 74745 bytes in 0.107 second response time [19:53:47] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 40.77, 40.74, 40.10 [19:56:57] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 42.77, 40.88, 40.18 [19:58:21] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3994134 (10Dzahn) @Volker_E just to be clear, 2 different repos that are both on Gerrit (from puppet's point of view). Whether you... [20:00:01] (03PS1) 10Rush: openstack: make nova compute kvm monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/413452 (https://phabricator.wikimedia.org/T187292) [20:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180222T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:18] (03CR) 10jerkins-bot: [V: 04-1] openstack: make nova compute kvm monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/413452 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:01:20] (03PS2) 10Rush: openstack: make nova compute kvm monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/413452 (https://phabricator.wikimedia.org/T187292) [20:01:49] (03CR) 10Rush: [C: 032] openstack: make nova compute kvm monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/413452 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:01:56] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 42.55, 40.68, 40.12 [20:02:31] greg-g: it's at 74% [20:07:36] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:10:38] akosiaris: well then.... [20:10:48] oh, I read that as 7%, 74% is much better :) [20:10:57] it's 94% now [20:11:00] basically 3 hosts left [20:11:10] give it 2-3 mins and you should be a go [20:12:17] PROBLEM - Apache HTTP on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [20:12:56] mutante: so who is my next person to talk to? [20:13:17] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.184 second response time [20:15:40] greg-g: you are good to go [20:16:05] thanks! [20:16:51] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 10.63, 14.06, 23.48 [20:18:00] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 12.35, 18.83, 29.47 [20:18:03] (03PS1) 10Ayounsi: DNS: Redirect all FB prefixes learned in ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413459 [20:18:53] (03CR) 10BBlack: [C: 031] DNS: Redirect all FB prefixes learned in ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413459 (owner: 10Ayounsi) [20:22:32] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:24] (03CR) 10Ayounsi: [C: 032] DNS: Redirect all FB prefixes learned in ulsfo to codfw [dns] - 10https://gerrit.wikimedia.org/r/413459 (owner: 10Ayounsi) [20:24:15] ottomata: RE: udp2log, that's rarely cross-dc though, right? [20:24:33] I mean, mediawiki aside from the switchovers is in the same dc as flourine/mwlog servers [20:24:42] old days it was lots cross DC [20:24:57] but, the 4% loss was caused more because of data volume [20:24:59] not network lossiness [20:25:09] ottomata: Which data was cross-dc though? [20:25:09] (03PS3) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [20:25:10] server network buffers would fill up before udp2log could process them [20:25:21] Krinkle: all the webrequest data used to be over udp2log [20:25:29] so, all esams logs -> eqiad over udp [20:25:35] ottomata: varnish/squid logs? [20:25:37] yup [20:25:39] Ah okay [20:25:44] I didn't realise that also used udp2log [20:25:50] yeah, long ago :) [20:25:53] I'm thinking mediawiki php log() messages [20:26:01] Cool [20:26:05] Thanks for that nugget of history :) [20:26:10] i think udp2log was originally written for webrequest logs...might not have been [20:27:17] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413461 (https://phabricator.wikimedia.org/T188034) [20:27:32] ottomata: I bet Tim knows. I think it was his work originally [20:28:40] <_joe_> him or domas? I don't remember [20:29:39] ya [20:29:40] it was tims [20:29:53] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413461 (https://phabricator.wikimedia.org/T188034) [20:30:02] domas wrote webstatscollector, the orignally pageview aggregator, which was fed from udp2log [20:30:09] original* [20:30:12] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 8.99, 16.12, 29.05 [20:32:30] https://phabricator.wikimedia.org/rSVN19361 [20:32:34] renames 'squid-log' to 'udplog' [20:33:16] ottomata: :) [20:39:16] :) [20:39:19] !log demon@tin Synchronized php-1.31.0-wmf.22/includes/libs/objectcache/WANObjectCache.php: betterer logging for cache ttl reduction, Iea029e78 (duration: 01m 13s) [20:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:56] (03CR) 10BBlack: [C: 031] Revert "DNS: redirect states from ulsfo to codfw" [dns] - 10https://gerrit.wikimedia.org/r/413438 (owner: 10Ayounsi) [20:44:21] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1948 bytes in 0.138 second response time [20:49:21] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1933 bytes in 0.082 second response time [20:50:21] (03PS5) 10Dzahn: introduce kafkamon1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) [20:52:41] (03CR) 10Dzahn: [C: 032] introduce kafkamon1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [20:54:09] (03CR) 10Dzahn: "[radon:~] $ host kafkamon1001.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [20:54:42] (03CR) 10Dzahn: "[radon:~] $ host 10.64.0.129" [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [20:55:29] (03CR) 10Dzahn: partman: add kafkamon[1-2]00[0-9] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [20:57:46] (03PS2) 10Dzahn: partman: add kafkamon[1-2]00[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) [20:57:51] PROBLEM - MD RAID on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:57:52] PROBLEM - Updater process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:01] PROBLEM - Blazegraph Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:02] PROBLEM - WDQS HTTP on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [20:58:03] PROBLEM - Blazegraph process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:11] PROBLEM - configured eth on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:12] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:14] (03PS1) 10Rush: openstack: nova active kvm check to alert multiple groups [puppet] - 10https://gerrit.wikimedia.org/r/413468 (https://phabricator.wikimedia.org/T187292) [20:58:21] PROBLEM - Check size of conntrack table on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:21] PROBLEM - DPKG on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [20:58:34] (03PS2) 10Rush: openstack: nova active kvm check to alert multiple groups [puppet] - 10https://gerrit.wikimedia.org/r/413468 (https://phabricator.wikimedia.org/T187292) [20:59:30] (03CR) 10Rush: [C: 032] openstack: nova active kvm check to alert multiple groups [puppet] - 10https://gerrit.wikimedia.org/r/413468 (https://phabricator.wikimedia.org/T187292) (owner: 10Rush) [20:59:44] Did wdqs1004 just crash? I'm starting the laptop, will check in a minute... [21:00:22] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:00:22] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [21:00:32] PROBLEM - Disk space on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:00:35] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:00:35] PROBLEM - dhclient process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:00:41] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:00:47] (03CR) 10Dzahn: [C: 032] "added "kafkamon" to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers" [dns] - 10https://gerrit.wikimedia.org/r/413281 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [21:00:48] I killed some old ports in row A...maybe it was mislabeled [21:00:51] PROBLEM - SSH on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 22: Connection refused [21:01:01] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:01:18] looking for wdqs1004 [21:02:11] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:11] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:21] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:32] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:41] puppetdb [21:02:41] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:47] mutante it's puppetdb [21:02:51] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:51] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:00] wdqs1004: can't connect nitrogen: puppet run ok [21:03:02] mutante: I'm back online... did you already find something for wdqs1004? [21:03:05] paladox: 2 seperate things [21:03:12] ok [21:03:19] gehel: no, i cant connect and neither to mgmt .. unlesss im doing it wrong [21:03:22] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:31] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:31] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:32] PROBLEM - puppet last run on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:03:32] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:32] paladox: i tested on one of them, they work [21:03:39] ok :) [21:03:41] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:51] mutante: I'm connected to mgmt, checking the console [21:03:56] great! [21:04:07] mutante: it is rebooting... [21:04:08] i'm checking puppet runs on another, conf1003.. no issue [21:04:11] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:13] gehel: ooh.. ok [21:04:21] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:21] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:22] actually no, [21:04:31] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:31] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:31] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:37] everything is in row A ....I think it was me...XioNox can you rollback on asw-a please [21:05:03] it seems like it was just down for a short moment [21:05:07] but is ok now [21:05:46] also all the "puppetmaster" alerts should be unrelated to what happened to wdqs1004 [21:06:03] which has other issues, like hardware [21:06:19] (03PS1) 10Bstorm: tools-static: Remove problematic headers from proxy responses [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) [21:06:58] i am expecting most of those to recover in a moment, except wdqs1004 [21:07:11] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:07:21] PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:07:35] ^ that [21:07:50] We rolled back any changes I made on switch so I don't think it's related tot hat [21:08:04] I'm on the console on wdqs1004, it looks reasonnably well except that I cant reach network [21:08:31] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:08:47] cmjohnson1: if it was at all.. it was just for a moment.. a moment is enough for the alert spam and then it takes icinga at least 5 minutes to get it's over even if the real issue was for 2 seconds [21:08:53] Is there a specific ldap group (or other) that I need to be in to access the machine oxygen.eqiad.wmnet? [21:09:19] marlier: LDAP groups are used for web-based logins but not for shell access. for that there are admin groups in the puppet repo [21:09:21] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:10:02] gehel: try powercycling first [21:10:24] mutante: I was going to ask if you had any better idea than powercycling :) [21:10:28] ok, on it [21:11:37] !log powercycling wdqs1004 (complete loss of network) [21:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:01] (03CR) 10Eevans: [C: 031] "> Is it that easy???" [puppet] - 10https://gerrit.wikimedia.org/r/413405 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [21:12:30] marlier: you would need to create a phab ticket please and add the tag "Ops-Access-Requests" if you want to request shell access [21:13:22] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:13:31] RECOVERY - Check size of conntrack table on wdqs1004 is OK: OK: nf_conntrack is 0 % full [21:13:31] RECOVERY - Host wdqs1004 is UP: PING WARNING - Packet loss = 64%, RTA = 0.22 ms [21:13:32] RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1004 is OK: OK ferm input default policy is set [21:13:32] RECOVERY - DPKG on wdqs1004 is OK: All packages OK [21:13:33] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 15990 bytes in 0.001 second response time [21:13:38] :) [21:13:42] wdqs is coming back online cleanly it seems... [21:13:51] RECOVERY - Disk space on wdqs1004 is OK: DISK OK [21:13:52] RECOVERY - dhclient process on wdqs1004 is OK: PROCS OK: 0 processes with command name dhclient [21:13:52] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [21:13:54] RECOVERY - SSH on wdqs1004 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [21:14:01] RECOVERY - MD RAID on wdqs1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:14:02] RECOVERY - Updater process on wdqs1004 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* org.wikidata.query.rdf.tool.Update [21:14:03] mutante: any idea how to investigate that (or anyone else) [21:14:11] RECOVERY - Blazegraph Port on wdqs1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [21:14:12] RECOVERY - WDQS HTTP on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 15990 bytes in 0.001 second response time [21:14:21] RECOVERY - Blazegraph process on wdqs1004 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [21:14:22] RECOVERY - configured eth on wdqs1004 is OK: OK - interfaces up [21:14:31] RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 435 bytes in 0.031 second response time [21:15:12] it is close to bed time here... so I'm not going to dig into this now, but I'm happy for any suggestion on how to investigate tomorrow... [21:15:35] gehel: well.. /var/log/syslog to see if it just ended randomly or there was anything happening before.. but if it just randomly freezes in the middle of normal work.. it happens and then we can just document it and see which hardware type it is.. does it always happen to a specific type.. etc [21:15:53] looking [21:16:04] mutante: thanks! [21:16:07] go to bed :) [21:16:25] 10Operations, 10Ops-Access-Requests: Requesting shell access sufficient to access oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T188042#3994287 (10Imarlier) [21:16:31] mutante: greg-g it seems things from earlier are resolved? should I fixup the topic here to say a-ok? [21:16:32] i can make at least a paste bin [21:16:46] 10Operations, 10Ops-Access-Requests: Requesting shell access sufficient to access oxygen.eqiad.wmnet - https://phabricator.wikimedia.org/T188042#3994300 (10Imarlier) [21:16:56] chasemp: yeppers [21:16:59] chasemp: i don't know since the topic thing was unrelated to this [21:17:08] but yea. afaict [21:17:34] (that's what it was before) [21:19:08] I'm about to try to trigger an alert to make sure a specific check works in relation to running instances on labvirt1018 [21:19:57] alright, thanks for the heads-up [21:20:56] PROBLEM - ensure kvm processes are running on labvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [21:21:16] that's me^^^^ normally errant but this is a test [21:21:56] RECOVERY - ensure kvm processes are running on labvirt1018 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm [21:23:15] mutante: any idea what caused that wdqs1004 issue? [21:23:45] (03CR) 10Ayounsi: [C: 032] Revert "DNS: redirect states from ulsfo to codfw" [dns] - 10https://gerrit.wikimedia.org/r/413438 (owner: 10Ayounsi) [21:23:49] (03PS2) 10Ayounsi: Revert "DNS: redirect states from ulsfo to codfw" [dns] - 10https://gerrit.wikimedia.org/r/413438 [21:24:08] XioNoX: i am looking right now, not yet [21:24:19] (03CR) 10BryanDavis: tools-static: Remove problematic headers from proxy responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [21:24:40] but it's not related to the puppetmaster thing [21:28:23] mutante: tomorrow I will verify the network switch port [21:28:25] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:28:25] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:28:44] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:29:14] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:29:24] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:29:34] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:29:34] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:34] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:30:34] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:31:04] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:32:05] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:32:44] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:32:44] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:32:44] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:32:54] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:32:54] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:34:34] PROBLEM - configured eth on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:34:35] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:34:36] PROBLEM - Check size of conntrack table on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:34:44] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:34:44] PROBLEM - DPKG on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:34:47] :| [21:34:48] XioNoX: there are java errors but all that stuff seems like red herrings and normal before the incident as well.. all i can really see is it ... stopped working [21:34:51] and while i type that [21:34:53] it dies again [21:35:01] hardware.. [21:35:15] it still replies to pings [21:35:31] mutante: I was going to ping you, but I see that you have things well in hands... :) [21:35:37] it doesnt like ssh anymore [21:35:54] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [21:36:04] PROBLEM - Disk space on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:06] PROBLEM - dhclient process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:06] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:14] PROBLEM - SSH on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 22: Connection refused [21:36:15] PROBLEM - MD RAID on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:24] PROBLEM - Updater process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:25] PROBLEM - WDQS HTTP on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [21:36:26] PROBLEM - Blazegraph Port on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:34] PROBLEM - Blazegraph process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:45] PROBLEM - puppet last run on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:36:54] Network interface Carrier transitions: 649, there is definitively something wrong with that host [21:37:00] (03PS2) 10Bstorm: tools-static: Remove problematic headers from proxy responses [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) [21:37:08] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet [21:37:13] depooled [21:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:44] (03CR) 10Bstorm: "Added :)" [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [21:37:54] RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1004 is OK: OK ferm input default policy is set [21:37:54] RECOVERY - DPKG on wdqs1004 is OK: All packages OK [21:37:55] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 15990 bytes in 0.001 second response time [21:38:04] RECOVERY - Disk space on wdqs1004 is OK: DISK OK [21:38:05] RECOVERY - dhclient process on wdqs1004 is OK: PROCS OK: 0 processes with command name dhclient [21:38:06] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [21:38:14] RECOVERY - SSH on wdqs1004 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [21:38:24] RECOVERY - MD RAID on wdqs1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:38:25] RECOVERY - Updater process on wdqs1004 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* org.wikidata.query.rdf.tool.Update [21:38:26] RECOVERY - WDQS HTTP on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 15990 bytes in 0.001 second response time [21:38:34] RECOVERY - Blazegraph Port on wdqs1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [21:38:44] RECOVERY - Blazegraph process on wdqs1004 is OK: PROCS OK: 1 process with UID = 997 (blazegraph), regex args ^java .* blazegraph-service-.*war [21:38:45] RECOVERY - configured eth on wdqs1004 is OK: OK - interfaces up [21:38:54] RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 435 bytes in 0.016 second response time [21:38:54] RECOVERY - Check size of conntrack table on wdqs1004 is OK: OK: nf_conntrack is 0 % full [21:39:29] ... [21:40:26] T188019 got security'd? [21:40:54] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:40:54] PROBLEM - DPKG on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:41:04] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [21:41:08] TheresNoTime: custom policy, yea [21:41:09] <_joe_> yes [21:41:11] yes [21:41:14] PROBLEM - Disk space on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:41:14] PROBLEM - dhclient process on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:41:15] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: Return code of 255 is out of bounds [21:41:16] PROBLEM - SSH on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 22: Connection refused [21:41:17] <_joe_> for a good reason tbh [21:41:29] <_joe_> can someone silence wdqs1004 please? [21:41:37] IDK it was a security issue when I first reported it [21:42:02] darn, fair enough, any sort of summary available? [21:42:03] <_joe_> Hauskatze: no one did [21:42:04] <_joe_> :P [21:42:12] okay so I feel a bit better [21:42:25] 10Operations: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994382 (10Dzahn) [21:42:33] _joe_: i am [21:43:19] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a deploy server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3994402 (10Niedzielski) @ovasileva, this was discussed in today's Services sync. There are a couple options @mobrovac and @Pchelolo are th... [21:44:06] must have been fairly 'interesting' as I was unsubbed after making comments :/ [21:44:27] (03PS1) 10Rush: openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) [21:44:32] The secret phab cabal! [21:44:37] what;s going on with wdqs4? [21:44:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [21:44:54] looks like it's dead... [21:44:58] SMalyshev: hardware failure and depooled [21:45:06] 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#3994435 (10Cmjohnson) A ticket has been created with Dell . You have successfully submitted request SR961176970. [21:45:07] afaict [21:45:16] (03PS2) 10Rush: openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) [21:45:23] https://phabricator.wikimedia.org/T188045 [21:45:29] bawolff: murr :P [21:45:45] (03CR) 10jerkins-bot: [V: 04-1] openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [21:45:46] mutante: ohh. is it fixable or it will need to be replaced? [21:46:03] TheresNoTime: Honestly, nothing super interesting. I think the non-security people got unsubbed primarily because some of them were being a little annoying with unhelpful suggestions ;) [21:46:17] * TheresNoTime takes the hint [21:46:18] SMalyshev: no idea, i depooled it a minute ago. it seems like it needs a new NIC or board so far [21:46:19] ;P [21:46:23] TheresNoTime: Not you [21:46:27] <_joe_> TheresNoTime: not you :P [21:46:28] TheresNoTime: not you [21:46:30] lol [21:46:31] <_joe_> ahahah [21:46:35] mutante: ok, will wait for updates [21:46:36] well that's not suspicious! [21:46:38] thanks [21:46:41] (03PS3) 10Rush: openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) [21:46:51] <_joe_> I swear we didn't coordinate that [21:46:56] mhmhm [21:47:02] (03CR) 10jerkins-bot: [V: 04-1] openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [21:47:04] <_joe_> pinkie swear [21:47:09] I do swear we didn't coordinate [21:47:39] TheresNoTime: But quick summary, we are still investigating the issue, and we're not sure of the cause [21:48:20] sure thing ^^ I had no idea what I was looking at anyway :) [21:49:06] And in the course of the investigation, we found an unrelated bug that's a more serious security issue [21:49:08] (03PS5) 10Rush: openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) [21:50:53] 10Operations: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994453 (10Dzahn) {F13956494} 16:07 < icinga-wm> PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100% 16:08 < gehel> I'm on the console on wdqs1004, it looks reasonnably well except that I cant reach network 16:11 < gehel>... [21:51:08] !log demon@tin Synchronized php-1.31.0-wmf.22/includes/externalstore/: I9334d36e (duration: 01m 15s) [21:51:09] will there be a write up at the end once its patched? Things like this do fascinate me, even if I don't fully understand it [21:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:55] (03CR) 10Rush: [C: 032] openstack: monitoring change for nova-network and conntrack [puppet] - 10https://gerrit.wikimedia.org/r/413481 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [21:51:57] TheresNoTime: Once issue is fixed, we will make the bug public again [21:52:12] 10Operations: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994382 (10Dzahn) 16:34 < mutante> XioNoX: there are java errors but all that stuff seems like red herrings and normal before the incident as well.. all i can really see is it ... stopped working 16:36 < XioNoX> Network interface Carrier... [21:52:18] (03PS1) 10Gehel: wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues [puppet] - 10https://gerrit.wikimedia.org/r/413485 (https://phabricator.wikimedia.org/T188045) [21:52:28] Awesome! :) Well good luck! [21:52:35] TheresNoTime: We usually try to always make security bugs eventually be public (Occassionally they might have personally identifiable info where we can't, but that's pretty rare) [21:52:52] (03CR) 10Smalyshev: [C: 031] wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues [puppet] - 10https://gerrit.wikimedia.org/r/413485 (https://phabricator.wikimedia.org/T188045) (owner: 10Gehel) [21:52:53] If there's ever a bug that's marked private, but you think should be public because its resolved, feel free to ask me about it [21:52:57] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994467 (10Dzahn) [21:53:21] mutante: just remembered we still have a non redundant service on wdqs1004 : https://gerrit.wikimedia.org/r/#/c/413485/ [21:53:24] or hack in bawolff account and say that you hacked the hacker :P [21:53:37] Hauskatze: yeah, that's one approach [21:53:40] mutante: could I ask you to merge and check on it [21:53:59] * gehel is already almost sleeping and does not trust himself with changing production [21:54:07] gehel: ok [21:54:20] mutante: thanks a lot! Beers on me next time! [21:54:49] bawolff: we all know your password is 'I'm Canadian, eh?' [21:55:05] (03PS2) 10Dzahn: wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues [puppet] - 10https://gerrit.wikimedia.org/r/413485 (https://phabricator.wikimedia.org/T188045) (owner: 10Gehel) [21:55:14] Hauskatze: needs more apologies [21:55:59] TheresNoTime: don't understand that, sorry [21:56:00] Hauskatze: nah, its 'dolphin' (super secure ever since https://github.com/danielmiessler/SecLists/pull/155 ) [21:56:50] (03CR) 10Dzahn: [C: 032] wdqs: move LDF endpoint to wdqs1005 since wdqs1004 is having hardware issues [puppet] - 10https://gerrit.wikimedia.org/r/413485 (https://phabricator.wikimedia.org/T188045) (owner: 10Gehel) [21:57:19] actually bawolff, since you're around, any thoughts on https://phabricator.wikimedia.org/T121186 ? [21:57:24] bawolff: goodness... [21:57:35] (namely comment https://phabricator.wikimedia.org/T121186#3861773) [21:57:55] everybody knows it's always hunter2 [21:58:08] all I see is ******* ? [21:59:45] MaxSem: yeah, that was my password prior to dolphins [21:59:58] SMalyshev: i merged that and now running puppet on all wdqs* [22:00:21] thanks! [22:00:34] (03PS1) 10Rush: openstack: nova-fullstack alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413486 (https://phabricator.wikimedia.org/T178405) [22:00:41] legoktm: is local usurpation an option for T188014 ? [22:00:41] T188014: MassMessage doesn't work on dty.wikipedia - https://phabricator.wikimedia.org/T188014 [22:01:04] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-fullstack alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413486 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [22:01:15] done.. though unexpectedly i get dropped to Cumin Interactive REPL something [22:01:25] Hauskatze: I read through that ticket, decided it was lower priority than the other one and haven't looked at it yet :) [22:01:56] but success on the 5 remaining ones [22:02:00] legoktm: okay :) [22:02:41] (03PS2) 10Rush: openstack: nova-fullstack alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413486 (https://phabricator.wikimedia.org/T178405) [22:02:49] (03CR) 10BryanDavis: [C: 031] tools-static: Remove problematic headers from proxy responses [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) (owner: 10Bstorm) [22:03:21] (03PS3) 10Rush: openstack: nova-fullstack alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413486 (https://phabricator.wikimedia.org/T178405) [22:04:06] (03CR) 10Rush: [C: 032] "irc +1 from andrew :)" [puppet] - 10https://gerrit.wikimedia.org/r/413486 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [22:04:55] (03PS3) 10Bstorm: tools-static: Remove problematic headers from proxy responses [puppet] - 10https://gerrit.wikimedia.org/r/413469 (https://phabricator.wikimedia.org/T182604) [22:08:59] (03PS1) 10Andrew Bogott: labweb: create /var/run/nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413490 (https://phabricator.wikimedia.org/T187506) [22:09:22] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#3994550 (10Legoktm) [22:09:41] (03CR) 10jerkins-bot: [V: 04-1] labweb: create /var/run/nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413490 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [22:11:14] (03PS2) 10Andrew Bogott: labweb: create /var/run/nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413490 (https://phabricator.wikimedia.org/T187506) [22:12:11] (03CR) 10Andrew Bogott: [C: 032] labweb: create /var/run/nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/413490 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [22:13:36] sigh 469 data error in /srv/mediawiki/php-1.31.0-wmf.21/extensions/Graph/includes/ApiGraph.php on line 125 [22:14:11] !log maxsem@tin Synchronized php-1.31.0-wmf.22/extensions/SyntaxHighlight_GeSHi/: T188019 (duration: 01m 14s) [22:14:23] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994582 (10Dzahn) i ran puppet on all wdqs* via cumin after the merge and smalyshev confirmed the LDF server seems fine https://query.wikidata.org/bigdata/ldf [22:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:56] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994589 (10Dzahn) @Papaul This server has been depooled. Could you please run hardware diagnostics on it? You can do it anytime. Thanks! [22:15:20] (03PS1) 10Rush: openstack: glance monitoring main should alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413491 (https://phabricator.wikimedia.org/T178405) [22:16:16] !log maxsem@tin Synchronized php-1.31.0-wmf.21/extensions/SyntaxHighlight_GeSHi/: T188019 (duration: 01m 12s) [22:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:14] (03PS2) 10Rush: openstack: glance monitoring main should alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413491 (https://phabricator.wikimedia.org/T178405) [22:17:17] ugh Canary error check failed for 2 canaries, less than threshold to halt deployment (2/11) [22:18:45] (03CR) 10Rush: [C: 032] openstack: glance monitoring main should alert wmcs-team [puppet] - 10https://gerrit.wikimedia.org/r/413491 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [22:18:55] hrm usually that logs to irc [22:19:55] that's Linter, btw: Warning: Unable to record MySQL stats with: EXPLAIN /* MediaWiki\Linter\Database::getTotalsEstimate */ SELECT * FROM `linter` WHERE linter_cat = '8' in /srv/mediawiki/php-1.31.0-wmf.21/includes/libs/rdbms/database/DatabaseMysqli.php on line 47 [22:21:49] looking at the canary dashboard in logstash it looks to have tapered off. unless you've rolled back it was probably an anomaly that coincided with deployment [22:24:06] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994636 (10Dzahn) [22:26:42] MaxSem: "Unable to record MySQL stats" is an olddddddddddd bug [22:26:48] HHVM claims its fixed [22:26:51] (it clearly isn't) [22:26:59] I'm not sure what about Linter has caused the rise in them lately tho [22:27:04] It's harmless, fwiw [22:30:15] (03PS1) 10Andrew Bogott: horizon: move caching to nutcracker, port 11212 [puppet] - 10https://gerrit.wikimedia.org/r/413625 [22:30:18] (03Abandoned) 10Andrew Bogott: labweb nutcracker: further attempt to pass in memcached_pools correctly [puppet] - 10https://gerrit.wikimedia.org/r/413396 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [22:30:59] !log demon@tin Synchronized php-1.31.0-wmf.22/includes/Storage/: Id5cdd8ec (duration: 01m 13s) [22:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:46] (03PS2) 10Andrew Bogott: horizon: move caching to nutcracker, port 11212 [puppet] - 10https://gerrit.wikimedia.org/r/413625 [22:32:12] !log demon@tin Synchronized php-1.31.0-wmf.22/includes/externalstore/: Id5cdd8ec (duration: 01m 12s) [22:32:17] (03CR) 10Andrew Bogott: [C: 032] horizon: move caching to nutcracker, port 11212 [puppet] - 10https://gerrit.wikimedia.org/r/413625 (owner: 10Andrew Bogott) [22:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:25] !log demon@tin Synchronized php-1.31.0-wmf.22/includes/filerepo/file/LocalFile.php: Id5cdd8ec (duration: 01m 12s) [22:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:49] Hauskatze: wait - you added yourself as a subscriber to https://phabricator.wikimedia.org/T188019 ? [22:34:59] I don't object, I just didn't think that was possible [22:35:20] bawolff: yes I did, as task author however, it was redundant [22:35:21] I guess because you are the author of the task [22:35:37] that it is [22:35:49] interesting quirks of phabricator [22:36:20] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3994681 (10katielin) Is the NDA the "Acknowledgement of Confidential Information" document? I... [22:37:01] (03PS1) 10Rush: openstack: pass critical from deployment [puppet] - 10https://gerrit.wikimedia.org/r/413626 (https://phabricator.wikimedia.org/T178405) [22:37:01] let me know if I shouldn't be there, in any case, I won't say anything as I do on the other private tasks I'm subscribed [22:38:46] (03CR) 10Rush: [C: 032] openstack: pass critical from deployment [puppet] - 10https://gerrit.wikimedia.org/r/413626 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [22:38:52] (03PS2) 10Rush: openstack: pass critical from deployment [puppet] - 10https://gerrit.wikimedia.org/r/413626 (https://phabricator.wikimedia.org/T178405) [22:41:11] * TheresNoTime slips Hauskatze a $20 [22:41:40] thanks for the comments on the audit suggestions btw bawolff [22:47:29] Hauskatze: I'm not worried about you being in there. I consider you to be a trusted person [22:47:56] bawolff: thank you, that's very kind [22:52:35] (03PS2) 10Madhuvishy: nfs-mount-manager: Add option to kill process accessing a mount [puppet] - 10https://gerrit.wikimedia.org/r/408864 (https://phabricator.wikimedia.org/T171540) [22:54:09] (03PS1) 10Madhuvishy: nfs traffic_shaping: Add labstore1006|7 to tc setup [puppet] - 10https://gerrit.wikimedia.org/r/413631 [22:56:34] (03CR) 10Rush: [C: 04-1] nfs traffic_shaping: Add labstore1006|7 to tc setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413631 (owner: 10Madhuvishy) [22:59:01] (03PS1) 10Chico Venancio: shinken: WMCS: fix Sum of puppet failures service deffinition [puppet] - 10https://gerrit.wikimedia.org/r/413632 (https://phabricator.wikimedia.org/T161898) [23:00:15] (03PS2) 10Madhuvishy: nfs traffic_shaping: Add labstore1006|7 to tc setup [puppet] - 10https://gerrit.wikimedia.org/r/413631 [23:04:00] (03PS3) 10Dzahn: partman: add kafkamon[1-2]00[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) [23:05:55] (03PS1) 10Rush: openstack: designate pass monitoring values through deployment [puppet] - 10https://gerrit.wikimedia.org/r/413634 (https://phabricator.wikimedia.org/T178405) [23:06:12] (03PS2) 10Rush: openstack: designate pass monitoring values through deployment [puppet] - 10https://gerrit.wikimedia.org/r/413634 (https://phabricator.wikimedia.org/T178405) [23:10:54] (03PS3) 10Rush: openstack: designate pass monitoring values through deployment [puppet] - 10https://gerrit.wikimedia.org/r/413634 (https://phabricator.wikimedia.org/T178405) [23:14:10] (03CR) 10Rush: [C: 032] openstack: designate pass monitoring values through deployment [puppet] - 10https://gerrit.wikimedia.org/r/413634 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [23:14:58] (03CR) 10Dzahn: [C: 032] partman: add kafkamon[1-2]00[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) (owner: 10Dzahn) [23:15:04] (03PS4) 10Dzahn: partman: add kafkamon[1-2]00[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/413283 (https://phabricator.wikimedia.org/T187901) [23:16:23] (03PS3) 10Madhuvishy: nfs traffic_shaping: Add labstore1006|7 to tc setup [puppet] - 10https://gerrit.wikimedia.org/r/413631 [23:17:39] mutante: I caught 'Dzahn: partman: add kafkamon[1-2]00[0-9] (f5d38bb)' on the master, ok to merge? [23:17:48] chasemp: i caught yours in the same moment :) [23:17:50] yes please [23:17:55] was about to type the same [23:18:04] heh, cool [23:19:10] (03CR) 10Rush: [C: 032] shinken: WMCS: fix Sum of puppet failures service deffinition [puppet] - 10https://gerrit.wikimedia.org/r/413632 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [23:19:15] (03PS2) 10Rush: shinken: WMCS: fix Sum of puppet failures service deffinition [puppet] - 10https://gerrit.wikimedia.org/r/413632 (https://phabricator.wikimedia.org/T161898) (owner: 10Chico Venancio) [23:25:47] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:25:56] (03PS1) 10Rush: openstack: designate pass critical from deployment typo [puppet] - 10https://gerrit.wikimedia.org/r/413636 (https://phabricator.wikimedia.org/T178405) [23:27:02] (03CR) 10Rush: [C: 032] openstack: designate pass critical from deployment typo [puppet] - 10https://gerrit.wikimedia.org/r/413636 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [23:27:05] (03PS2) 10Rush: openstack: designate pass critical from deployment typo [puppet] - 10https://gerrit.wikimedia.org/r/413636 (https://phabricator.wikimedia.org/T178405) [23:27:57] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:30:47] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:36:12] (03CR) 10Paladox: "@Alexandros Kosiaris we have been on puppet 4 for a couple months, can we remove ruby-mysql now?" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [23:49:27] (03PS6) 10Krinkle: multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 [23:53:05] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:53:21] (03CR) 10BBlack: [C: 031] Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) (owner: 10Gilles) [23:55:22] no_justification: If you have a minute to review https://gerrit.wikimedia.org/r/#/c/410110/, could roll out in the next hour. [23:55:39] Iv'e rolled out the extract2 fix a few days ago [23:55:47] So it should have 0 uses now [23:56:04] https://github.com/search?q=org:wikimedia+MW_LANG&ref=opensearch&type=Code [23:57:35] (03CR) 10Krinkle: [C: 031] Stop routing Varnish thumb.php traffic to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/413185 (https://phabricator.wikimedia.org/T187899) (owner: 10Gilles) [23:57:40] (03PS1) 1020after4: scap sync-canary plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413640 [23:58:50] bblack: Is there a tracking task for decom of rendering app servers? Couldn't find one, but might be called something else [23:59:03] (03CR) 10Chad: [C: 031] multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 (owner: 10Krinkle) [23:59:11] Krinkle: I trust you've grepped around and shit [23:59:20] * no_justification has no brain power left to do it right now [23:59:35] no_justification: Aye, in all repos, and I'll also do an extensive test on mwdebug to be sure. [23:59:50] I'll do a grep on tin as well, good point.