[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T0000). [00:04:05] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89987.98 seconds [00:06:14] (03CR) 10BryanDavis: [C: 031] l10nupdate: Reduce code duplication in git clone operations [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [00:08:15] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:13:05] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3123846 (10leila) @faidon, Aaron pointed that I should have a look at this conversation given that some of the work we may do with TensorFlows in the coming year may... [00:15:08] (03PS1) 10Tim Starling: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) [00:18:11] (03CR) 10Dzahn: labtest: convert labtestmetal to labtestvirt2001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [00:20:21] (03Abandoned) 10Dzahn: add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [00:20:23] (03PS2) 10Tim Starling: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) [00:21:15] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:53] (03CR) 10Dzahn: [C: 031] "yep, Ganglia is "officially" deprecated" [puppet] - 10https://gerrit.wikimedia.org/r/337002 (owner: 10Ema) [00:24:39] (03CR) 10Dzahn: "bump. still want it during next maintenance?" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [00:25:15] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:25:35] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10MusikAnimal) [00:28:32] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10kaldari) As Leon's manager, I approve and endorse this request. [00:35:15] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:36:15] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [00:37:52] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10Dzahn) for other ops: this is about access to maintenance hosts (terbium and wasat) to run MySQL queries against prod dbs from there. that means: role::mediawiki::main... [00:39:33] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3123954 (10faidon) a:03RobH OK, thank you all for the quick response. That sounds like a plan. @RobH, could you work with our vendors to get quotes with the AMD Fir... [00:40:22] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123959 (10MusikAnimal) >>! In T161181#3123952, @Dzahn wrote: > So this could be "restricted" and that would be enough for right now. ( description: access to terbium, mwlog hosts... [00:42:25] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:35] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:22] (03PS3) 10Tim Starling: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) [00:45:15] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:45:25] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [00:49:57] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10bd808) Community Tech is losing a deployer (me) in a week. Adding Foundation staff with interest to the deployers group seems like the sort of thing we should encourage... [00:50:15] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:58:15] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:15] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:27:15] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [01:27:29] 07Puppet, 06Release-Engineering-Team: Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607#3124033 (10Mattflaschen-WMF) [01:27:56] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3124034 (10Dzahn) >>! In T161181#3123971, @bd808 wrote: > Adding Foundation staff with interest to the deployers group seems like the sort of thing we should encourage rather than... [01:30:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 57263.140862 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 57263.158179 Seconds [01:30:15] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 57269.12127 Seconds [01:30:25] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 58337.798801 Seconds [01:30:26] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 58339.609873 Seconds [01:31:55] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 58421.841331 Seconds [01:33:05] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:35:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:35:05] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:38:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 57743.28421 Seconds [01:38:05] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 57743.308474 Seconds [01:38:31] 06Operations, 10Ops-Access-Requests: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157#3123255 (10Dzahn) for other ops: this is a gerrit-admin thing, not shell access or LDAP. https://gerrit.wikimedia.org/r/#/admin/groups/21,members [01:38:48] 06Operations, 10Ops-Access-Requests, 10Gerrit: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157#3124038 (10Dzahn) [01:39:55] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:41:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:44:20] 06Operations, 10Ops-Access-Requests, 10Gerrit: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157#3124056 (10Dzahn) done in gerrit web ui. (per: milimetric has existing shell with deployment access, ottomata has root) [01:44:26] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 59179.634663 Seconds [01:46:20] !log added ottomata and milimetric to "wmf-deployers" in Gerrit web ui, both have existing (deployment resp. root) shell already (T161157) [01:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:27] T161157: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157 [01:46:55] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 59321.638503 Seconds [01:47:22] 06Operations, 10Ops-Access-Requests, 10Gerrit: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157#3124076 (10Dzahn) 05Open>03Resolved a:03Dzahn [01:58:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:58:05] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:58:15] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:58:25] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 4.795039 Seconds [01:58:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 6.410932 Seconds [01:58:55] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:59:33] (03PS4) 10Tim Starling: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) [02:02:05] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:03:40] (03PS1) 10Dzahn: netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) [02:05:21] (03PS2) 10Dzahn: netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) [02:05:23] (03CR) 10jerkins-bot: [V: 04-1] netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [02:06:54] (03CR) 10jerkins-bot: [V: 04-1] netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [02:07:20] (03PS3) 10Dzahn: netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) [02:21:12] (03CR) 10Dzahn: [C: 032] netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [02:21:19] (03PS4) 10Dzahn: netmon1001 migration: separate rsync modules for each app [puppet] - 10https://gerrit.wikimedia.org/r/344292 (https://phabricator.wikimedia.org/T125020) [02:23:48] heh, that nick change must be based on phase-of-the-moon :) [02:35:08] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.16) (duration: 13m 26s) [02:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:04] (03PS1) 10Dzahn: netmon1001 migration: additional paths to backup [puppet] - 10https://gerrit.wikimedia.org/r/344298 (https://phabricator.wikimedia.org/T125020) [03:02:52] (03PS1) 10Dzahn: netmon1001: add host to Bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/344299 (https://phabricator.wikimedia.org/T125020) [03:04:00] (03CR) 10Dzahn: [C: 032] netmon1001 migration: additional paths to backup [puppet] - 10https://gerrit.wikimedia.org/r/344298 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [03:04:08] (03PS2) 10Dzahn: netmon1001 migration: additional paths to backup [puppet] - 10https://gerrit.wikimedia.org/r/344298 (https://phabricator.wikimedia.org/T125020) [03:07:15] (03CR) 10Dzahn: [C: 032] netmon1001: add host to Bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/344299 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [03:07:21] (03PS2) 10Dzahn: netmon1001: add host to Bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/344299 (https://phabricator.wikimedia.org/T125020) [03:08:02] (03CR) 10Pnorman: [C: 04-1] "As part of https://phabricator.wikimedia.org/T160781 I found a bug, the replication change files need to be changed as .osc.gz, not .osm.g" [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [03:08:18] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 14m 39s) [03:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:52] (03PS1) 10Dzahn: bacula: add file sets for librenms,torrus,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344300 (https://phabricator.wikimedia.org/T125020) [03:13:05] (03CR) 10jerkins-bot: [V: 04-1] bacula: add file sets for librenms,torrus,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344300 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [03:14:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 23 03:14:05 UTC 2017 (duration 5m 47s) [03:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:19] (03PS2) 10Dzahn: bacula: add file sets for librenms,torrus,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344300 (https://phabricator.wikimedia.org/T125020) [03:15:19] (03CR) 10jerkins-bot: [V: 04-1] bacula: add file sets for librenms,torrus,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344300 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [03:15:43] (03PS3) 10Dzahn: bacula: add file sets for librenms,torrus,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344300 (https://phabricator.wikimedia.org/T125020) [03:17:43] (03CR) 10Dzahn: [C: 032] bacula: add file sets for librenms,torrus,smokeping [puppet] - 10https://gerrit.wikimedia.org/r/344300 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [03:26:15] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.74 seconds [03:28:24] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3124133 (10Dzahn) Needs one more patch to add backup sets to role but my laptop is about to shut down and i could not upload for some reason... [03:29:15] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 164.95 seconds [03:43:25] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:10:25] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:12:25] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:15:08] https://wikitech.wikimedia.org/wiki/Logs#mwlog1001:.2Fa.2Fmw-log.2F [04:15:14] /a/mw-log definitely doesn't exist [04:15:17] did it get moved? [04:15:53] /srv apparently [04:21:53] (03PS1) 10Reedy: Add skin, language, and variant to user_properties_anon [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) [04:22:35] heh [04:22:53] (03CR) 10jerkins-bot: [V: 04-1] Add skin, language, and variant to user_properties_anon [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) (owner: 10Reedy) [04:23:00] Let me guess, line too long [04:23:54] 79 chars 4 life [04:24:21] (03PS2) 10Reedy: Add skin, language, and variant to user_properties_anon [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) [04:24:26] I think we allow 100 for some dumb reason [04:24:34] we do [04:24:36] this was 117 [04:24:46] 79 man, 79 [04:25:05] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:25:22] that's going to fail for hanging indent :) [04:25:55] nope [04:25:58] oh creepy you used embedded linebraeak [04:27:23] thanks Reedy :) [04:28:11] Now, as it's almost 04:30, I should probably sleep [04:30:46] I'm hitting lots of 503's on phab [04:30:50] e.g Request from 2601:646:9300:ef30:aeb5:7dff:fe20:498a via cp2006 cp2006, Varnish XID 183589195
Error: 503, Backend fetch failed at Thu, 23 Mar 2017 04:30:18 GMT [04:32:24] I'll +1 that phab problem. Just had to reload a couple of times and still had ajax fetch failures [04:33:39] (03CR) 10Chad: [C: 031] "Has this been added to make-wmf-branch yet? Doing so will ensure the code gets picked up on the next and subsequent branch cycles. Otherwi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) (owner: 10Tim Starling) [04:33:42] seems to be working again now... [04:34:54] (03CR) 10BryanDavis: [C: 031] Add skin, language, and variant to user_properties_anon [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) (owner: 10Reedy) [04:38:25] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [04:54:05] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:57:20] (03PS1) 10BryanDavis: openstack: Fix password keyword arg in mwopenstackclients [puppet] - 10https://gerrit.wikimedia.org/r/344304 [04:59:15] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:16:36] 06Operations, 06Performance-Team: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193#3124184 (10Peter) [05:28:15] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [05:34:25] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 10706.962769 Seconds [05:34:25] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 10708.531535 Seconds [05:34:55] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 10730.878889 Seconds [05:37:25] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [05:37:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [05:37:55] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [06:47:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344306 (https://phabricator.wikimedia.org/T137191) [06:49:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344306 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:50:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344306 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:50:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344306 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [06:52:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T137191 (duration: 00m 44s) [06:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:46] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [07:08:23] !log Stop MySQL db1082 for maintenance - https://phabricator.wikimedia.org/T137191 [07:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:15] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:28] ACKNOWLEDGEMENT - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] Giuseppe Lavagetto T161159 [07:23:25] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1059, repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344308 (https://phabricator.wikimedia.org/T73563) [07:33:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1059, repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344308 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:34:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1059, repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344308 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:36:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1068, depool db1059 T160415 - T73563 (duration: 00m 43s) [07:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:12] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [07:36:12] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [07:37:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1059, repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344308 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [07:37:53] !log Deploy schema change s4 on db1068 and labsdb1001 T160415 - T73563 [07:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:15] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:45:53] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3124257 (10MoritzMuehlenhoff) a:03Cmjohnson [07:51:25] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:53:21] great: phone appeared to be working but wn't passing calls orpages [07:53:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [07:55:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:00:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:01:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:06:32] !log installing audiofile security updates [08:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:05] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:12:03] (03PS1) 10Urbanecm: Close wikimania2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344321 (https://phabricator.wikimedia.org/T161183) [08:15:42] (03CR) 10Jcrespo: [C: 04-1] "Those are not mariadb-related files. Move them to a module, but not mariadb's." [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [08:24:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [08:29:36] !log Stop db1070 MySQL db1070 for maintenance - T137191 [08:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:42] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [08:32:15] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:15] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:15] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:15] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:15] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:25] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:25] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:25] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:25] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:27] Probably due to backups ^ [08:32:30] Will silence it [08:32:35] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:45] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:48] i'm getting "Error: 503, Backend fetch failed " on enwiki [08:32:55] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:56] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:56] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:56] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:56] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:56] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:56] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:57] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:32:57] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:15] (... after trying to save an edit) [08:33:48] (worked ok on second try) [08:34:35] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3124283 (10Beetlebeard) Thanks alot. G Suite is verified. All should work, as it it supposed to. Cheers [08:36:55] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:36:55] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:36:55] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:36:55] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:36:55] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:36:56] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:36:56] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:36:57] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:36:57] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:36:59] this pattern is concerning: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?from=1490228854356&to=1490246296838&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-role=All [08:37:05] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:37:05] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:37:05] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:37:05] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:37:15] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:37:15] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [08:37:15] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:37:15] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:37:15] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:37:15] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [08:37:25] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:37:35] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:37:41] are those dumps? [08:37:48] yep [08:37:50] dumops are running [08:37:59] *dumps [08:38:08] <_joe_> at 3 AM either dumps or some cronjob [08:38:14] they are running too fast [08:38:23] that will affect the save and read time [08:38:27] of uncached requests [08:38:37] <_joe_> apergos: ^^ [08:38:39] they are coming from the dump user, so I assume dumps? [08:38:53] or the dump user is used for something else too? [08:39:34] See: https://grafana.wikimedia.org/dashboard/db/save-timing?from=1490228854000&to=1490246854000 [08:40:24] <_joe_> marostegui: do you also have the originating IP? [08:40:31] <_joe_> that's gonna make things easier :) [08:40:43] dump is dumps only [08:40:53] othewise it would be an abuse [08:41:03] It is from localhost, so probably the /etc/bacula/dumps script :) [08:41:24] marostegui, I think you are confusing 2 things [08:41:30] dumps and our backups [08:41:47] we do not care about our backups- those only read on dbstore1001 [08:41:54] oh yes, dumps I was meaning mysqldumps [08:42:11] they can bring down the server and nobody would care [08:42:23] they probably should be slower, but that is our responsability [08:42:48] dumps on the other side read from production machines, and have an impact on save and read time [08:43:04] the cliff on my first link is too steep [08:43:09] yes, you are right I was meaning dumps as in mysqldump as in backups [08:47:46] !log repooled mw1261 now that T161095 is deployed [08:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:52] T161095: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095 [08:51:23] (03PS1) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/344322 [08:55:12] (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/344322 (owner: 10Muehlenhoff) [08:59:54] (03PS1) 10Muehlenhoff: Add another retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/344323 [09:00:54] (03CR) 10Muehlenhoff: [C: 032] Add another retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/344323 (owner: 10Muehlenhoff) [09:11:50] (03PS1) 10Muehlenhoff: Update to 4.4.53 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344325 [09:14:05] !log Jenkins upgrading SSH Slaves plugin. Might cause disruption in CI [09:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:10] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.53 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344325 (owner: 10Muehlenhoff) [09:17:35] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [09:19:20] (03PS1) 10Muehlenhoff: Update to 4.4.54 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344326 [09:19:35] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [09:20:55] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:55] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [09:22:55] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [09:24:05] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 24523.096438 Seconds [09:24:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 24523.099002 Seconds [09:24:15] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 24528.922585 Seconds [09:24:33] "jenkins_zmq_publisher on contint1001 is OK" was me restarting Jenkins [09:24:34] <_joe_> should we do something about those postgres replica alarms? [09:26:05] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [09:26:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [09:26:05] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [09:26:06] _joe_: I asked gehel to have a look [09:26:15] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [09:26:29] <_joe_> yeah I want to know if those are alarms I should care for [09:26:50] no, being maps-test [09:26:56] ooops today are maps [09:26:59] not -test [09:27:36] they recovered though [09:27:50] 24k seconds is 1/4 of a day though [09:27:52] weird [09:28:44] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.54 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344326 (owner: 10Muehlenhoff) [09:31:20] (03PS1) 10Muehlenhoff: Update to 4.4.55 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344329 [09:33:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice. minor inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [09:37:13] volans, _joe_ : no emergency on the pg replication. I had a look at maps-test, but the maps1/2 are probably different. [09:38:09] volans, _joe_ : my laptop screen just broke, I'm trying to get it fixed, but won't be able to look into more details for a few hours :( [09:39:04] gehel: :/ read the matrix ;) [09:40:06] I have broken my cool glasses, and left my Nokia phone at home [09:43:37] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.55 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344329 (owner: 10Muehlenhoff) [09:48:23] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3124380 (10jcrespo) [09:48:25] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242#3124379 (10jcrespo) 05Open>03Resolved [09:49:55] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:44] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3124383 (10jcrespo) es2015 crashed on 2017-03-11, faulty cpu and board replaced. [09:52:24] sorry, they just dropped off a pallete (wooden shipping container) in front of my house, no tools to open it so I've been dealing with that and shlepping the boxes upstairs [09:52:45] that's all the rest of my belongings now [09:54:16] dumps running from my end should be doing not much, quick requests and not too many, this is no different than any other month, and it's not even the history dumps... [09:54:30] just fetching the occasional revision that the previous dumps don't have [09:54:52] it might be some other job that runs as the wikiadmin user (wikidata cron or other?) [09:56:35] (03PS1) 10Jcrespo: Pool all es2XXX servers, depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344332 (https://phabricator.wikimedia.org/T129350) [09:58:07] !log Jenkins: upgrading plugins email-ext and mailer [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:03] it would be nice to know what the queries were, to make sure [10:03:08] (03PS1) 10Muehlenhoff: Update to 4.4.56 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344335 [10:04:06] (03PS1) 10Marostegui: netboot.cfg: Format db1082 [puppet] - 10https://gerrit.wikimedia.org/r/344337 (https://phabricator.wikimedia.org/T137191) [10:06:29] (03PS2) 10Jcrespo: mariadb: Pool all es2XXX servers, depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344332 (https://phabricator.wikimedia.org/T129350) [10:08:04] (03CR) 10Marostegui: [C: 031] mariadb: Pool all es2XXX servers, depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344332 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [10:12:15] [10:12:16] [10:12:16] [10:15:08] (03PS1) 10Filippo Giunchedi: prometheus: keep more chunks in memory [puppet] - 10https://gerrit.wikimedia.org/r/344340 (https://phabricator.wikimedia.org/T148408) [10:15:58] (03PS1) 10Muehlenhoff: Enable experimental section on all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/344341 [10:20:18] (03CR) 10Jcrespo: [C: 031] netboot.cfg: Format db1082 [puppet] - 10https://gerrit.wikimedia.org/r/344337 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [10:20:36] !log Jenkins jobs got slightly blocked because I forgot to cancel the shutdown when jobs had to run. [10:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:05] (03CR) 10Marostegui: [C: 032] netboot.cfg: Format db1082 [puppet] - 10https://gerrit.wikimedia.org/r/344337 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [10:28:17] (03CR) 10Filippo Giunchedi: [C: 032] "untestable with PCC due to puppetdb usage" [puppet] - 10https://gerrit.wikimedia.org/r/344340 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [10:28:20] (03PS1) 10Muehlenhoff: Enable systemd-timesyncd for all Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/344342 (https://phabricator.wikimedia.org/T150257) [10:28:23] (03PS2) 10Filippo Giunchedi: prometheus: keep more chunks in memory [puppet] - 10https://gerrit.wikimedia.org/r/344340 (https://phabricator.wikimedia.org/T148408) [10:28:43] (03CR) 10Jcrespo: [C: 032] mariadb: Pool all es2XXX servers, depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344332 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [10:30:32] (03Merged) 10jenkins-bot: mariadb: Pool all es2XXX servers, depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344332 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [10:30:43] (03CR) 10jenkins-bot: mariadb: Pool all es2XXX servers, depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344332 (https://phabricator.wikimedia.org/T129350) (owner: 10Jcrespo) [10:32:05] 06Operations, 06WMF-Legal, 10Wikimedia-General-or-Unknown, 07Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3124512 (10valhallasw) I hereby license all my existing contributions to the operations/puppet under the Apache 2.0 license = https://ge... [10:32:44] (03Abandoned) 10Ema: tlsproxy: enable Lua support by default [puppet] - 10https://gerrit.wikimedia.org/r/343827 (owner: 10Ema) [10:33:19] (03CR) 10Merlijn van Deen: [C: 031] "As noted on T67270, I license all my existing contributions to the operations/puppet repo under the Apache 2.0 license." [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [10:37:25] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/5871/" [puppet] - 10https://gerrit.wikimedia.org/r/344342 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [10:37:52] (03PS1) 10Alexandros Kosiaris: osm: Specify the prometheus_path variable correctly [puppet] - 10https://gerrit.wikimedia.org/r/344343 [10:39:29] (03PS2) 10Alexandros Kosiaris: osm: Specify the prometheus_path variable correctly [puppet] - 10https://gerrit.wikimedia.org/r/344343 [10:39:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm: Specify the prometheus_path variable correctly [puppet] - 10https://gerrit.wikimedia.org/r/344343 (owner: 10Alexandros Kosiaris) [10:43:55] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3124517 (10akosiaris) 05Open>03Resolved a:03akosiaris It seems we may have resolved this. I 'll be resolving the task, we can always reopen [10:54:18] (03PS1) 10Alexandros Kosiaris: Add an $os parameter to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/344347 [10:54:20] (03PS1) 10Alexandros Kosiaris: Add an os parameter to netops::check [puppet] - 10https://gerrit.wikimedia.org/r/344348 [10:54:22] (03PS1) 10Alexandros Kosiaris: Pass Junos as the os for all networking devices [puppet] - 10https://gerrit.wikimedia.org/r/344349 [10:59:32] !log Actually restarting Jenkins for email plugins upgrades [10:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:21] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#3124548 (10Gilles) It seems like Mediawiki 200ing while Thumbor 504s has about halved: ``` gilles@ms-fe1005:/var/log/swift$ cat server.log.1 | gre... [11:01:44] !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool all es2XXX servers, depool es2014 for maintenance (duration: 00m 43s) [11:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:51] !log pool prometheus200[34] / depool prometheus200[12] - T148408 [11:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:58] T148408: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408 [11:08:07] !log codfw-prod: bump ms-be2028 ms-be2039 object weight to 2000 T158337 [11:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:14] T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337 [11:11:52] (03PS6) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [11:13:20] (03PS3) 10Alexandros Kosiaris: sync_icinga_state: stop/start the service [puppet] - 10https://gerrit.wikimedia.org/r/343889 [11:13:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] sync_icinga_state: stop/start the service [puppet] - 10https://gerrit.wikimedia.org/r/343889 (owner: 10Alexandros Kosiaris) [11:13:36] (03PS2) 10Alexandros Kosiaris: Add an $os parameter to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/344347 [11:13:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add an $os parameter to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/344347 (owner: 10Alexandros Kosiaris) [11:17:19] (03PS2) 10Alexandros Kosiaris: Add an os parameter to netops::check [puppet] - 10https://gerrit.wikimedia.org/r/344348 [11:17:21] (03PS2) 10Alexandros Kosiaris: Pass Junos as the os for all networking devices [puppet] - 10https://gerrit.wikimedia.org/r/344349 [11:21:20] (03PS3) 10Alexandros Kosiaris: Add an os parameter to netops::check [puppet] - 10https://gerrit.wikimedia.org/r/344348 [11:21:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add an os parameter to netops::check [puppet] - 10https://gerrit.wikimedia.org/r/344348 (owner: 10Alexandros Kosiaris) [11:21:39] (03PS3) 10Alexandros Kosiaris: Pass Junos as the os for all networking devices [puppet] - 10https://gerrit.wikimedia.org/r/344349 [11:21:46] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Pass Junos as the os for all networking devices [puppet] - 10https://gerrit.wikimedia.org/r/344349 (owner: 10Alexandros Kosiaris) [11:22:23] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.56 [debs/linux44] - 10https://gerrit.wikimedia.org/r/344335 (owner: 10Muehlenhoff) [11:24:03] (03Abandoned) 10Alexandros Kosiaris: git-review: Disabling rebasing by default [puppet] - 10https://gerrit.wikimedia.org/r/289672 (owner: 10Alexandros Kosiaris) [11:24:11] (03PS2) 10Muehlenhoff: Enable experimental section on all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/344341 [11:34:19] (03CR) 10Muehlenhoff: [C: 032] Enable experimental section on all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/344341 (owner: 10Muehlenhoff) [11:36:12] PROBLEM - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [90.0] [11:36:24] expected ^ [11:38:13] (03PS7) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [11:38:48] <_joe_> well it's good it fires, then :) [11:39:30] <_joe_> I am going to merge that patch ^^, will set downtime for the host in icinga as I fully expect to have screwed something up [11:40:12] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:40:50] ACKNOWLEDGEMENT - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 18.18% of data under the critical threshold [90.0] Filippo Giunchedi new ms-be machines [11:41:16] (03CR) 10Filippo Giunchedi: [C: 031] Enable systemd-timesyncd for all Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/344342 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [11:45:26] (03CR) 10Giuseppe Lavagetto: [C: 032] "PCC: https://puppet-compiler.wmflabs.org/5875/" [puppet] - 10https://gerrit.wikimedia.org/r/344169 (owner: 10Giuseppe Lavagetto) [11:45:33] (03PS8) 10Giuseppe Lavagetto: role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 [11:47:14] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::jobqueue_redis::master: depend on etcd, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344169 (owner: 10Giuseppe Lavagetto) [11:47:42] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3124693 (10fgiunchedi) 05Open>03Resolved Finishing the setup in {T148408} [11:50:28] (03PS1) 10Giuseppe Lavagetto: profile::jobqueue_redis::master: include redis::passwords [puppet] - 10https://gerrit.wikimedia.org/r/344359 [11:50:50] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::jobqueue_redis::master: include redis::passwords [puppet] - 10https://gerrit.wikimedia.org/r/344359 (owner: 10Giuseppe Lavagetto) [11:54:56] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#3124719 (10Gilles) Actually, nginx was already at 90s, but I think we can double that and see the effect, because it might not be enough as-is to a... [11:55:54] (03PS1) 10Gilles: Increase nginx timeout on Thumbor machines [puppet] - 10https://gerrit.wikimedia.org/r/344360 (https://phabricator.wikimedia.org/T150746) [12:02:36] PROBLEM - Redis replication status tcp_6381 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 3776951 keys, up 2 days 18 hours - replication_delay is 617 [12:03:06] PROBLEM - Redis replication status tcp_6378 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 10.192.0.120:6378 has 1 databases (db0) with 14 keys, up 2 days 18 hours - replication_delay is 629 [12:03:06] PROBLEM - Redis replication status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 3781816 keys, up 2 days 18 hours - replication_delay is 647 [12:03:06] PROBLEM - Redis replication status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 8479897 keys, up 2 days 18 hours - replication_delay is 646 [12:03:38] (03PS1) 10Gilles: Increase Thumbor original file size limit to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/344361 (https://phabricator.wikimedia.org/T151456) [12:05:00] !log converting es2014 tables back to uncompressed InnoDB T129350 [12:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:07] T129350: es2014 revert data compression - https://phabricator.wikimedia.org/T129350 [12:06:12] es2014 may lag ^see log- it has been downtimed and depooled [12:07:16] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [12:08:06] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3751378 keys, up 2 days 18 hours - replication_delay is 37 [12:08:16] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:10:06] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 613 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3781515 keys, up 2 days 17 hours - replication_delay is 613 [12:13:06] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3751704 keys, up 2 days 17 hours - replication_delay is 0 [12:14:42] (03PS1) 10Marostegui: labs.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344362 (https://phabricator.wikimedia.org/T73563) [12:15:43] (03PS2) 10Marostegui: labs.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344362 (https://phabricator.wikimedia.org/T73563) [12:19:29] (03CR) 10Jcrespo: [C: 031] labs.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344362 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:20:25] (03CR) 10Marostegui: [C: 032] labs.my.cnf: Set slave_type_conversions [puppet] - 10https://gerrit.wikimedia.org/r/344362 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:20:33] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/5876/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/344362 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:20:50] <_joe_> rdb2002 is me [12:20:58] <_joe_> downtiming that too [12:21:29] !log Deploy schema change s4 on labsdb1003 https://phabricator.wikimedia.org/T160415 - https://phabricator.wikimedia.org/T73563 [12:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:31] !log installing libxml2 security updates [12:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064, repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344365 (https://phabricator.wikimedia.org/T73563) [12:38:35] (03PS1) 10Giuseppe Lavagetto: profile::jobqueue_redis::instances: fix uid/gid for confd::file [puppet] - 10https://gerrit.wikimedia.org/r/344366 [12:41:33] (03PS1) 10Volans: Refactor check for mediawiki config [switchdc] - 10https://gerrit.wikimedia.org/r/344367 (https://phabricator.wikimedia.org/T160178) [12:41:35] (03PS1) 10Volans: Allow to execute integration tests from tox [switchdc] - 10https://gerrit.wikimedia.org/r/344368 (https://phabricator.wikimedia.org/T160178) [12:41:36] PROBLEM - parsoid on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:37] (03PS1) 10Volans: Make all custom exceptions inherit from a common one [switchdc] - 10https://gerrit.wikimedia.org/r/344369 (https://phabricator.wikimedia.org/T160178) [12:41:39] (03PS1) 10Volans: Moved all libraries in a single place [switchdc] - 10https://gerrit.wikimedia.org/r/344370 (https://phabricator.wikimedia.org/T160178) [12:41:41] (03PS1) 10Volans: Fix import error [switchdc] - 10https://gerrit.wikimedia.org/r/344371 (https://phabricator.wikimedia.org/T160178) [12:41:43] (03PS1) 10Volans: Uniform some docstrings and log formatting [switchdc] - 10https://gerrit.wikimedia.org/r/344372 (https://phabricator.wikimedia.org/T160178) [12:41:52] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3124830 (10jcrespo) a:03jcrespo [12:43:26] RECOVERY - parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.102 second response time [12:47:09] (03PS2) 10Giuseppe Lavagetto: profile::jobqueue_redis::instances: fix uid/gid for confd::file [puppet] - 10https://gerrit.wikimedia.org/r/344366 [12:54:59] (03PS3) 10Giuseppe Lavagetto: profile::jobqueue_redis::instances: fix uid/gid for confd::file [puppet] - 10https://gerrit.wikimedia.org/r/344366 [12:55:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064, repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344365 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:56:06] RECOVERY - Redis replication status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 8451999 keys, up 2 days 19 hours - replication_delay is 0 [12:57:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064, repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344365 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:57:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064, repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344365 (https://phabricator.wikimedia.org/T73563) (owner: 10Marostegui) [12:57:51] (03PS4) 10Giuseppe Lavagetto: profile::jobqueue_redis::instances: fix uid/gid for confd::file [puppet] - 10https://gerrit.wikimedia.org/r/344366 [12:58:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1059, depool db1064 T160415 - T73563 (duration: 00m 43s) [12:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:32] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [12:58:32] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [12:59:43] !log Deploy schema change s4 on db1064 https://phabricator.wikimedia.org/T160415 - https://phabricator.wikimedia.org/T73563 [12:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T1300). Please do the needful. [13:01:20] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::jobqueue_redis::instances: fix uid/gid for confd::file [puppet] - 10https://gerrit.wikimedia.org/r/344366 (owner: 10Giuseppe Lavagetto) [13:04:06] RECOVERY - Redis replication status tcp_6378 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6378 has 1 databases (db0) with 15 keys, up 2 days 19 hours - replication_delay is 0 [13:06:06] RECOVERY - Redis replication status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 3753607 keys, up 2 days 19 hours - replication_delay is 0 [13:06:36] RECOVERY - Redis replication status tcp_6381 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 3748609 keys, up 2 days 19 hours - replication_delay is 0 [13:09:28] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3124870 (10Ottomata) Thanks all! :) [13:13:32] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::nrpe: add package libredis-perl [puppet] - 10https://gerrit.wikimedia.org/r/344374 [13:14:12] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] redis::monitoring::nrpe: add package libredis-perl [puppet] - 10https://gerrit.wikimedia.org/r/344374 (owner: 10Giuseppe Lavagetto) [13:14:46] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 504 (expecting: 200) [13:15:46] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [13:20:17] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::nrpe_instance: fix resource name [puppet] - 10https://gerrit.wikimedia.org/r/344376 [13:20:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] redis::monitoring::nrpe_instance: fix resource name [puppet] - 10https://gerrit.wikimedia.org/r/344376 (owner: 10Giuseppe Lavagetto) [13:22:06] (03PS1) 10Muehlenhoff: Extend list of privileged Phabricator projects for offboarding tool [puppet] - 10https://gerrit.wikimedia.org/r/344377 [13:24:56] (03CR) 10Muehlenhoff: [C: 032] Extend list of privileged Phabricator projects for offboarding tool [puppet] - 10https://gerrit.wikimedia.org/r/344377 (owner: 10Muehlenhoff) [13:25:01] (03PS2) 10Muehlenhoff: Extend list of privileged Phabricator projects for offboarding tool [puppet] - 10https://gerrit.wikimedia.org/r/344377 [13:26:42] (03CR) 10Muehlenhoff: [V: 032 C: 032] Extend list of privileged Phabricator projects for offboarding tool [puppet] - 10https://gerrit.wikimedia.org/r/344377 (owner: 10Muehlenhoff) [13:29:29] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3124908 (10Joe) @elukey looking at the numbers, the only slightly... [13:29:32] 06Operations, 10Ops-Access-Requests, 10Gerrit: archiva-deploy password for Chad H. - https://phabricator.wikimedia.org/T161067#3120509 (10MoritzMuehlenhoff) Yeah, but that requires the setup of a second pwstore repo, since the current one for ops is on a restricted host. [13:36:36] PROBLEM - ores on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:22] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3124953 (10MoritzMuehlenhoff) I've sent a mail to the WMF Legal department and added you to CC. When the NDA is on file, we can proceed with th... [13:38:45] PROBLEM - parsoid on wtp2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:50] (03PS1) 10Alexandros Kosiaris: Revert "service::uwsgi: Remove the log-route directives" [puppet] - 10https://gerrit.wikimedia.org/r/344378 [13:39:06] (03PS1) 10Jcrespo: Revert "netboot.cfg: Format db1082" [puppet] - 10https://gerrit.wikimedia.org/r/344379 [13:39:12] (03PS2) 10Jcrespo: Revert "netboot.cfg: Format db1082" [puppet] - 10https://gerrit.wikimedia.org/r/344379 [13:39:19] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3124954 (10MoritzMuehlenhoff) a:05GoranSMilovanovic>03None Also, removing you as the assignee, the actual change will be handled by someone... [13:39:35] RECOVERY - parsoid on wtp2020 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.120 second response time [13:40:19] (03CR) 10Marostegui: [C: 031] "I was going to do it a bit later, as the server was reimaged fine. But thanks for keeping it on your radar too!" [puppet] - 10https://gerrit.wikimedia.org/r/344379 (owner: 10Jcrespo) [13:42:20] (03PS2) 10Muehlenhoff: Enable systemd-timesyncd for all Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/344342 (https://phabricator.wikimedia.org/T150257) [13:46:55] (03CR) 10Muehlenhoff: [C: 032] Enable systemd-timesyncd for all Debian systems [puppet] - 10https://gerrit.wikimedia.org/r/344342 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [13:47:12] (03PS1) 10Alexandros Kosiaris: service::uwsgi: name the local logger [puppet] - 10https://gerrit.wikimedia.org/r/344381 (https://phabricator.wikimedia.org/T149010) [13:48:29] (03PS1) 10Giuseppe Lavagetto: profile::jobqueue_redis::instances: add explicit dependency [puppet] - 10https://gerrit.wikimedia.org/r/344382 [13:48:58] (03PS2) 10Giuseppe Lavagetto: profile::jobqueue_redis::instances: add explicit dependency [puppet] - 10https://gerrit.wikimedia.org/r/344382 [13:49:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::jobqueue_redis::instances: add explicit dependency [puppet] - 10https://gerrit.wikimedia.org/r/344382 (owner: 10Giuseppe Lavagetto) [13:49:33] (03PS2) 10Alexandros Kosiaris: service::uwsgi: name the local logger [puppet] - 10https://gerrit.wikimedia.org/r/344381 (https://phabricator.wikimedia.org/T149010) [13:51:28] _joe_: 2nd time I rebase, 3 and you are out :P [13:51:32] (03PS3) 10Alexandros Kosiaris: service::uwsgi: name the local logger [puppet] - 10https://gerrit.wikimedia.org/r/344381 (https://phabricator.wikimedia.org/T149010) [13:51:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] service::uwsgi: name the local logger [puppet] - 10https://gerrit.wikimedia.org/r/344381 (https://phabricator.wikimedia.org/T149010) (owner: 10Alexandros Kosiaris) [13:52:03] (03PS2) 10Alexandros Kosiaris: Revert "service::uwsgi: Remove the log-route directives" [puppet] - 10https://gerrit.wikimedia.org/r/344378 [13:52:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "service::uwsgi: Remove the log-route directives" [puppet] - 10https://gerrit.wikimedia.org/r/344378 (owner: 10Alexandros Kosiaris) [13:52:35] lol [13:52:47] * volans add himself to the merging queue [13:53:20] xddd [13:53:32] !log cirrus: refreshing comp suggest indices in elastic@eqiad to measure times [13:53:39] (03PS2) 10Volans: Failoid: add service to reject connections [puppet] - 10https://gerrit.wikimedia.org/r/343917 (https://phabricator.wikimedia.org/T160994) [13:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:35] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 3345 bytes in 0.012 second response time [13:55:13] (03PS1) 10Giuseppe Lavagetto: site: switch the other codfw masters to the new code [puppet] - 10https://gerrit.wikimedia.org/r/344384 [13:56:00] <_joe_> akosiaris: lol I was annoyed having to rebase and about to blame you :P [13:56:04] (03PS2) 10Giuseppe Lavagetto: site: switch the other codfw masters to the new code [puppet] - 10https://gerrit.wikimedia.org/r/344384 [13:56:11] (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344385 (https://phabricator.wikimedia.org/T137191) [13:56:18] (03CR) 10Volans: [C: 032] Failoid: add service to reject connections [puppet] - 10https://gerrit.wikimedia.org/r/343917 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [13:56:36] (03PS1) 10Muehlenhoff: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/344386 [13:56:42] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] site: switch the other codfw masters to the new code [puppet] - 10https://gerrit.wikimedia.org/r/344384 (owner: 10Giuseppe Lavagetto) [13:56:49] (03PS3) 10Giuseppe Lavagetto: site: switch the other codfw masters to the new code [puppet] - 10https://gerrit.wikimedia.org/r/344384 [13:58:04] (03PS1) 10Hashar: swift: lower replication interval for beta [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) [13:58:51] (03PS4) 10Giuseppe Lavagetto: site: switch the other codfw masters to the new code [puppet] - 10https://gerrit.wikimedia.org/r/344384 [13:59:38] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] site: switch the other codfw masters to the new code [puppet] - 10https://gerrit.wikimedia.org/r/344384 (owner: 10Giuseppe Lavagetto) [14:00:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344385 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [14:02:06] (03Abandoned) 10Ottomata: Release 2.4.0-1 [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/292478 (owner: 10Ottomata) [14:02:22] (03Abandoned) 10Ottomata: Also start/stop/restart keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/259596 (owner: 10Ottomata) [14:02:33] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344385 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [14:02:43] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344385 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [14:03:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with low weight - T137191 (duration: 00m 48s) [14:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:50] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [14:04:19] 06Operations, 13Patch-For-Review, 07discovery-system: Create the failoid service as fallback for the DNS discovery - https://phabricator.wikimedia.org/T160994#3125083 (10Volans) [14:05:49] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:05:59] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#3125086 (10Volans) [14:06:01] 06Operations, 13Patch-For-Review, 07discovery-system: Create the failoid service as fallback for the DNS discovery - https://phabricator.wikimedia.org/T160994#3117766 (10Volans) 05Open>03Resolved Service up and running on `roentgenium` and `tureis` with puppet role `failoid`, refusing connections to port... [14:06:34] 06Operations, 10MediaWiki-Internationalization, 07HHVM, 05MW-1.29-release (WMF-deploy-2017-03-14_(1.29.0-wmf.16)), and 3 others: Uninitialized string offset warnings with HHVM 3.18 in LanguageAz.php and LanguageKk.php - https://phabricator.wikimedia.org/T161095#3125092 (10MoritzMuehlenhoff) The ucfirst err... [14:08:28] (03Abandoned) 10Alexandros Kosiaris: service: add log-route in uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/343772 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup) [14:09:50] PROBLEM - parsoid on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:09] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490278260 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3752731 keys, up 4 minutes 20 seconds - replication_delay is 1490278260 [14:11:39] RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.791 second response time [14:14:03] _joe_: ^^^ [14:14:18] (03PS1) 10Giuseppe Lavagetto: site: convert rdb1007 to the new role [puppet] - 10https://gerrit.wikimedia.org/r/344392 [14:14:52] <_joe_> volans: I think it's the usual problem again [14:15:35] <_joe_> akosiaris: I restarted replication on 2005 and it seems we do have some problems [14:15:45] _joe_: possibly, if it's unrelated to you. just when I 've resolved the task [14:16:04] maybe it's still catching up ? [14:16:13] ah no [14:16:23] it's already 5 mins since the PROBLEM report [14:16:27] <_joe_> [32129] 23 Mar 14:13:48.486 # I/O error trying to sync with MASTER: connection lost [14:16:38] yeah that's the same issue [14:17:04] maybe we should increase the soft limit as well [14:17:29] Client id=327817604 addr=10.192.32.133:51794 fd=68 name= age=189 idle=189 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=12604 oll=13529 omem=336745112 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. [14:17:36] omem=336745112 [14:17:49] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 660 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3760066 keys, up 2 days 19 hours - replication_delay is 660 [14:18:03] yeah it's 320 mb [14:18:35] <_joe_> akosiaris: can you look at that? :) [14:18:37] how on earth did it manage to have a client output buffer of 320 mb for more than 60secs [14:18:46] <_joe_> I have my hands full [14:18:46] yeah doing so now [14:19:08] <_joe_> akosiaris: I think I know what would fix that, but that's for a few weeks from now, maybe [14:19:43] <_joe_> akosiaris: keep in mind I'm about to restart all redises on rdb1007 :P [14:20:03] (03CR) 10Giuseppe Lavagetto: [C: 032] site: convert rdb1007 to the new role [puppet] - 10https://gerrit.wikimedia.org/r/344392 (owner: 10Giuseppe Lavagetto) [14:20:56] _joe_: if you are messing with rbd1007 I can't really take a look [14:21:02] I 'll wait for you to finish [14:21:12] better if we don't step on each other's feet [14:21:59] PROBLEM - NTP on db2012 is CRITICAL: NTP CRITICAL: No response from NTP server [14:22:00] <_joe_> akosiaris: heh I'm almost done [14:22:20] (03CR) 10Ottomata: [C: 032] Introduce linters using rake [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/338384 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:22:27] <_joe_> akosiaris: done! [14:22:28] (03CR) 10Ottomata: [C: 032] "Sure, nobody uses this repo anymore though! :)" [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/338384 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:22:33] (03CR) 10Ottomata: [V: 032 C: 032] Introduce linters using rake [puppet/cdh4] - 10https://gerrit.wikimedia.org/r/338384 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:22:44] ottomata: -:} [14:22:49] (03CR) 10Ottomata: [C: 032] Introduce linters using rake [puppet/kafka] - 10https://gerrit.wikimedia.org/r/338385 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:22:54] !log installing exim4 updates from jessie point release [14:22:57] (03CR) 10Ottomata: [C: 032] "We don't use this repo anymore." [puppet/kafka] - 10https://gerrit.wikimedia.org/r/338385 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:22:59] (03CR) 10Ottomata: [V: 032 C: 032] Introduce linters using rake [puppet/kafka] - 10https://gerrit.wikimedia.org/r/338385 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:52] (03CR) 10Ottomata: "I have a feeling this was fixed elsewhere. Was it? If so, can we abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [14:25:43] (03CR) 10Muehlenhoff: "No, that's still an issue and needs review :-)" [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [14:28:43] PROBLEM - parsoid on wtp2013 is CRITICAL: HTTP CRITICAL - No data received from host [14:29:22] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 614656 [14:29:48] <_joe_> what' [14:29:52] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490279390 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3752731 keys, up 23 minutes 10 seconds - replication_delay is 1490279390 [14:29:54] <_joe_> s up with parsoid in codfw? [14:30:07] <_joe_> cp is killing it? [14:30:12] PROBLEM - Check the NTP synchronisation status of timesyncd on db1061 is CRITICAL: NRPE: Command check_timesynd_ntp_status not defined [14:30:43] RECOVERY - parsoid on wtp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.124 second response time [14:30:45] <_joe_> parsoid/47195 on wtp2013: Heap memory limit temporarily exceeded [14:30:51] <_joe_> heh that's what happened [14:31:10] (03PS2) 10Andrew Bogott: openstack: Fix password keyword arg in mwopenstackclients [puppet] - 10https://gerrit.wikimedia.org/r/344304 (owner: 10BryanDavis) [14:31:33] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:33:10] (03PS1) 10Urbanecm: Add autopatrolled group to svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344393 (https://phabricator.wikimedia.org/T161210) [14:33:16] (03CR) 10Andrew Bogott: [C: 032] "Thanks! I guess I never actually used the command-line arg features :(" [puppet] - 10https://gerrit.wikimedia.org/r/344304 (owner: 10BryanDavis) [14:34:01] (03CR) 10Urbanecm: [C: 04-1] "Who should assign/revoke the group? Clarifying in the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344393 (https://phabricator.wikimedia.org/T161210) (owner: 10Urbanecm) [14:35:22] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 591859 [14:35:49] (03Abandoned) 10Hashar: CI: Install php7.1 [puppet] - 10https://gerrit.wikimedia.org/r/343209 (owner: 10Paladox) [14:36:42] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:39:17] 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3125236 (10Gilles) [14:39:28] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3125239 (10Gilles) [14:40:58] (03PS4) 10DCausse: [es5 upgrade] step 5: restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) [14:43:52] PROBLEM - parsoid on wtp2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:43] RECOVERY - parsoid on wtp2012 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.128 second response time [14:45:52] PROBLEM - parsoid on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:42] RECOVERY - parsoid on wtp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.092 second response time [14:49:52] PROBLEM - parsoid on wtp2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:20] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3125278 (10Gehel) @EBernhardson any reason to use kibana 5.1.2 and not the latest 5.2.2? [14:50:42] PROBLEM - DPKG on labvirt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:50:43] RECOVERY - parsoid on wtp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.126 second response time [14:52:43] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3125282 (10akosiaris) 05Resolved>03Open And reopening :-). Replication failed for rdb2005 and rdb2006 (it's a cascading replication, failure of rdb2005 is expect... [14:54:41] (03PS1) 10Giuseppe Lavagetto: site: move all jobqueue redis masters to the new model [puppet] - 10https://gerrit.wikimedia.org/r/344396 [14:54:43] (03PS1) 10Giuseppe Lavagetto: role::jobqueue_redis: deprecate in the current form [puppet] - 10https://gerrit.wikimedia.org/r/344397 [14:55:55] 06Operations, 10Ops-Access-Requests, 10Gerrit: Add two Analytics team members to wmf-deployments - https://phabricator.wikimedia.org/T161157#3125292 (10Milimetric) Thank you very much [14:56:01] elukey: you were right ofc about also bumping the soft client-output-buffer-limit [14:56:04] (03PS1) 10Alexandros Kosiaris: Increase client_output_buffer_limit soft limit [puppet] - 10https://gerrit.wikimedia.org/r/344399 (https://phabricator.wikimedia.org/T159850) [14:56:22] PROBLEM - parsoid on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:25] turns out rdb2005 during resyncis is IO starved, taking longer than normal to catch up [14:56:42] RECOVERY - DPKG on labvirt1001 is OK: All packages OK [14:56:46] !log dist-upgrading labvirt1001 and rebooting it a few times [14:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:03] not only that, but it's in IOwait quite often [14:57:12] RECOVERY - parsoid on wtp2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.127 second response time [14:57:41] (03CR) 10Alexandros Kosiaris: [C: 032] Increase client_output_buffer_limit soft limit [puppet] - 10https://gerrit.wikimedia.org/r/344399 (https://phabricator.wikimedia.org/T159850) (owner: 10Alexandros Kosiaris) [14:59:00] !log disabled puppet on rdb* fleet [14:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:18] !log enabling and running puppet on rdb200X fleet in a rolling restart scheme [14:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:12] RECOVERY - Check the NTP synchronisation status of timesyncd on db1061 is OK: OK: synced at Thu 2017-03-23 15:00:02 UTC. [15:01:47] o/ akosiaris [15:02:12] gehel: hello [15:02:20] I'm looking at PoolCounter. It doesn't seem to be it's own thing. Is pool counter 3rd party or something we build and maintain? [15:02:42] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3125313 (10Gehel) logstash=1:5.2.2-1 uploaded to jessi-wikimedia / experimental, waiting for confirma... [15:02:54] papaul: hello! Ready to break elastic2020 again? [15:03:02] yes [15:03:12] halfak: it is it's own thing. written by Tim years ago [15:03:16] papaul: elastic2020 is still shutdown [15:03:20] gehel: i just power it on [15:03:32] akosiaris, oh.. hmm... where does the server code live? [15:03:59] * gehel is waiting for elastic2020 to come up... [15:04:20] papaul: so the plan is that I run the same test and hope to crash it in the same way as last time? [15:04:28] gehel: yes [15:04:45] halfak: hmm maybe we have never exported it to gerrit. the code is definitely open source though. A quick way to fetch it would be https://apt.wikimedia.org/wikimedia/pool/main/p/poolcounter/poolcounter_1.0.4.tar.gz [15:04:48] gehel: is it up now [15:04:53] yep [15:05:00] loging out from console [15:05:02] halfak: 1.0.4 is the latest release [15:05:10] not in version control? o.O [15:05:15] papaul: ok, I'm ready to crash... [15:05:42] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:05:56] halfak: :/ [15:06:01] gehel: last time it took 25 minutes for it to crash [15:06:27] halfak: a no, we have it in gerrit [15:06:33] https://github.com/wikimedia/mediawiki-extensions-PoolCounter/tree/master/daemon [15:06:54] halfak: per https://wikitech.wikimedia.org/wiki/PoolCounter, the daemon code lives next to the extension [15:08:04] papaul: stress + bonnie++ launched [15:08:12] gehel: cool [15:09:03] <_joe_> halfak: we do develop poolcounter [15:09:12] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 608352 [15:09:13] <_joe_> but sorry, in a meeting [15:09:26] <_joe_> no poolcounter is definitely on gerrit [15:09:33] _joe_: read backlog [15:09:34] :P [15:09:37] gotcha. I've got to say I'm still very skeptical of investing in this. [15:09:46] ok, load is going up: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=elastic2020&var-network=eth0 [15:10:58] halfak: not pushing towards that. But as I 've said we are using it without a hitch for 5+ years, so it's definitely worth an investigation [15:11:21] s/hitch/glitch/ [15:11:47] That is a good point. I might have the opposite of NIHS :) [15:11:48] yes cirrussearch is a heavy user of the poolcounter, never caused us troubles [15:11:51] <_joe_> halfak: what do you need to do? [15:12:04] Limit the number of connections to ORES by a single client. [15:12:07] _joe_, ^ [15:12:08] <_joe_> yeah it's a great distributed shared locking system [15:12:27] Might be in a need of a bit of love [15:12:38] without it elasticsearch would be down in no time [15:12:59] halfak: TL;DR It's the result of the "Michael Jackson" effect [15:13:02] _joe_: https://phabricator.wikimedia.org/T160692 for context [15:13:32] gehel: since the last time it took 25 minustes before it crash i am going to move some decom server to storage am i will be back [15:15:04] it's one of those pieces of engineering that are so simple and yet ingenious that one ends up not appreciating it as much as they should cause it never gets in the way [15:15:20] papaul: sure, I'll ping you if I see something... [15:15:37] gehel: ok [15:17:22] PROBLEM - parsoid on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:12] RECOVERY - parsoid on wtp2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.865 second response time [15:23:40] 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#3125358 (10MoritzMuehlenhoff) 05Open>03Resolved systemd-timesyncd is now enabled on all Debian systems (except those serving as time servers itself) [15:29:12] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [15:29:19] (03PS1) 10Marostegui: db-eqiad.php: Increase db1082 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344402 (https://phabricator.wikimedia.org/T137191) [15:32:15] !log upgrading restbase-test* to Linux 4.9 [15:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1082 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344402 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [15:34:55] 06Operations, 10DNS, 10Traffic: AuthDNS CM/CI refactor - https://phabricator.wikimedia.org/T161148#3125387 (10BBlack) [15:35:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1082 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344402 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [15:36:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1082 weight - T137191 (duration: 00m 43s) [15:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:05] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [15:37:12] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1082 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344402 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [15:40:01] gehel: it looks like to load is dropping... good [15:40:07] PROBLEM - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:17] (03PS1) 10Volans: Failoid: reject all TCP traffic [puppet] - 10https://gerrit.wikimedia.org/r/344406 (https://phabricator.wikimedia.org/T160994) [15:40:39] parsoid paged? [15:40:50] same [15:40:57] RECOVERY - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 1.607 second response time [15:41:07] that was fast [15:41:07] at least my paging works now :) [15:41:10] mine too [15:41:13] papaul: not by much yet... [15:41:21] (03PS2) 10Volans: Failoid: reject all TCP traffic [puppet] - 10https://gerrit.wikimedia.org/r/344406 (https://phabricator.wikimedia.org/T160994) [15:41:46] page and recovery yep [15:42:02] PROBLEM - parsoid on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:11] And mine. [15:42:12] seems like parsoid has been flapping for awhile [15:42:14] * apergos does the backread [15:42:52] RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.161 second response time [15:42:53] PROBLEM - parsoid on wtp2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:43:23] papaul: this is already taking longer than last time, bonnie is in another test phase, probably not generating the same kind of load anymore... [15:43:42] RECOVERY - parsoid on wtp2016 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.145 second response time [15:43:49] gehel:good [15:44:05] the last time it took 25 minutes [15:44:08] papaul: not sure if it is good :/ [15:45:00] papaul: if we can't crash it anymore, do we just say that the problem fixed itself - https://www.youtube.com/watch?v=CGeT5cutXgU [15:45:23] gehel: do you need virtualization (VT) enable on elastic boxes? [15:45:38] nope, no reason to have it enabled [15:46:05] e test [15:46:17] (03CR) 10Volans: [C: 032] Failoid: reject all TCP traffic [puppet] - 10https://gerrit.wikimedia.org/r/344406 (https://phabricator.wikimedia.org/T160994) (owner: 10Volans) [15:46:24] gehel: that is one thing i turn off before you started the test [15:46:31] (03PS2) 10Filippo Giunchedi: Increase nginx timeout on Thumbor machines [puppet] - 10https://gerrit.wikimedia.org/r/344360 (https://phabricator.wikimedia.org/T150746) (owner: 10Gilles) [15:46:56] papaul: ok, so at least we have probably cause... [15:47:19] cmjohnson1: ping on https://phabricator.wikimedia.org/T159632 ;) [15:47:36] hi, it's mutante. my laptop broke ("fan error"). i am now on the way to sf to get a loaner but i have no access to anything. so you wont seeme [15:47:42] until later [15:48:04] ottomata: it's going to have to wait [15:48:05] papaul: bonnie completed its run, I'm restarting it for another pass. If it runs to the end, we can decide to pull that server back in and see how it works... [15:48:09] sorry more pressing issues right now [15:48:11] ok [15:48:24] if you get a moment, can you just update ticket with an ETA? [15:49:06] godog: it just needs one more change to add "backup:set" to each of the 3 roles . librenms, torrus, smokeping [15:49:09] gehel: ok [15:49:29] godog: see above, i'll do it once i have access again [15:49:45] mutante_: ok! thanks for the update :) [15:49:59] gehel: may I suggest one test: let it run idle for e.g. one week (not pooling it) and then put aggressive load on it? [15:50:13] dcausse: sure... [15:50:43] (03CR) 10Filippo Giunchedi: [C: 032] Increase nginx timeout on Thumbor machines [puppet] - 10https://gerrit.wikimedia.org/r/344360 (https://phabricator.wikimedia.org/T150746) (owner: 10Gilles) [15:50:55] (03PS1) 10Alexandros Kosiaris: logstash: Filter ores logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) [15:52:55] (03CR) 10Jforrester: "make-wmf-branch done in I0afc58e5." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) (owner: 10Tim Starling) [15:53:38] gehel: also (iirc) the second time it failed 2/3 hours after the switchover [15:53:49] (03PS6) 10Eevans: Enable cqlsh client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342679 (https://phabricator.wikimedia.org/T111113) [15:53:50] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3125426 (10Dispenser) [15:53:58] 06Operations, 13Patch-For-Review, 07discovery-system: Create the failoid service as fallback for the DNS discovery - https://phabricator.wikimedia.org/T160994#3125427 (10Volans) Given that there are a lot of services on non-standard ports and the lvs_services configuration had multiple instances for each dis... [15:55:00] 06Operations, 06Performance-Team, 15User-fgiunchedi: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193#3125428 (10fgiunchedi) a:03fgiunchedi @Peter yeah I can, standard procedure is to upgrade https://grafana-labs.wikimedia.org first and see how that goes, if no issues arise then upgrade... [15:55:22] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 359220 [15:55:34] dcausse: yeah, but we're putting quite a bit more load now than during the switchover (which might or might not be related to the failure) [15:56:16] (03PS2) 10Alexandros Kosiaris: logstash: Filter ores logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) [15:58:05] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3125436 (10Gehel) [15:58:21] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3104732 (10Gehel) kibana uploaded as well [15:59:22] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 1048 [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T1600). Please do the needful. [16:00:14] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3125440 (10jcrespo) [16:00:40] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Import new kibana and logstash .debs to wikimedia experimental repository - https://phabricator.wikimedia.org/T160597#3125441 (10EBernhardson) for reference kibana is lockstepped with elasticsearch minor... [16:00:40] urandom: I'll take a look at your patch [16:00:55] godog: kk [16:01:49] (03CR) 10Filippo Giunchedi: [C: 032] Enable cqlsh client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342679 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [16:03:40] gehel: did you poweroff elastic2020? [16:04:02] papaul: nope, seems it just crashed again... [16:04:15] gehel: good news [16:04:23] kind of... [16:04:47] urandom: merged, looks good on restbase1007 [16:05:06] godog: cool; thanks! [16:05:31] godog: probably not easy to tell if it's resulting in encrypted connections [16:05:43] papaul: I guess you take over from there? [16:05:52] maybe i'll live-hack one cassandra instance to make encryption non-optional [16:05:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "This would change the settings in production too" [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [16:06:38] gehel: yes looking at the log now [16:06:46] gehel: thanks [16:07:02] papaul: well, thanks to you! Now you have more work... [16:07:19] urandom: yeah I was thinking if there was some cql command to introspect the connection ? [16:07:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/342604 (https://phabricator.wikimedia.org/T119140) (owner: 10Hashar) [16:07:35] !log restbase deploy 752ca4b7 [16:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:42] subbu, _joe_: ^^ [16:07:54] for blacklisting those two titles [16:08:17] (03PS1) 10Alexandros Kosiaris: Followup increasing client_output_buffer_limit [puppet] - 10https://gerrit.wikimedia.org/r/344411 (https://phabricator.wikimedia.org/T159850) [16:10:10] (03CR) 10Ladsgroup: "One minor thing. Thanks for taking care of this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) (owner: 10Alexandros Kosiaris) [16:10:36] !log T111113: Live-hacking client encryption to be non-optional, to verify cqlsh encryption, restbase1007-a.eqiad.wmnet [16:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:41] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:10:52] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:08] mobrovac, can you share with alro the 2 titles .. he was asking in the parsoid channel. [16:11:27] oh sure [16:11:36] (03CR) 10Alexandros Kosiaris: [C: 032] Followup increasing client_output_buffer_limit [puppet] - 10https://gerrit.wikimedia.org/r/344411 (https://phabricator.wikimedia.org/T159850) (owner: 10Alexandros Kosiaris) [16:15:22] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! [16:15:33] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! [16:15:33] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Read timed out. (read timeout=5) [16:15:33] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: connection error: (urllib3.connectionpool.HTTPConnectionPool object at 0x7f5c0e21ed10, Connection to localhost timed out. (connect timeout=5)) [16:16:12] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490285771 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3758326 keys, up 1 minutes 6 seconds - replication_delay is 1490285771 [16:16:22] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490285779 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3756184 keys, up 1 minutes 14 seconds - replication_delay is 1490285779 [16:17:29] <_joe_> Mar 23 16:15:04 ocg1003 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1003","pid":3581,"level":30,"channel":"gc","msg":"Finished GarbageCollection run in 107.44 seconds","time":"2017-03-23T16:15:04.652Z","v":0} [16:17:34] <_joe_> smells like java! [16:17:41] or nodejs [16:17:49] <_joe_> yeah it's nodejs [16:17:53] <_joe_> but it smells like java [16:17:55] I can restart while you do your stuff [16:18:08] <_joe_> jynus: I just restarted ocg on 1003 [16:18:12] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3758886 keys, up 3 minutes 6 seconds - replication_delay is 0 [16:18:13] <_joe_> I guess more are needed [16:18:16] ok [16:18:19] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344413 (https://phabricator.wikimedia.org/T137191) [16:18:19] doing [16:18:22] _joe_: there is a "j" in nodejs for a reason [16:18:22] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [16:18:22] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3756550 keys, up 3 minutes 13 seconds - replication_delay is 0 [16:18:32] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 758461 msg: ocg_render_job_queue 0 msg [16:18:33] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:18:47] ah rdb2005 is fine again :-) [16:18:47] godog: it works, though you have to remember to pass --ssl [16:19:05] why on earth those db change at that rate.. I am not sure [16:19:18] we should get better viewing into redis operations [16:19:45] we only have memory from what I see [16:20:14] (03PS2) 10Giuseppe Lavagetto: site: move all jobqueue redis masters to the new model [puppet] - 10https://gerrit.wikimedia.org/r/344396 [16:20:19] https://grafana-admin.wikimedia.org/dashboard/db/redis-jobqueue-elukey [16:20:24] heh... I love luca :-) [16:20:41] gehel: nothing in the log [16:20:52] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3754512 keys, up 5 minutes 47 seconds - replication_delay is 0 [16:21:06] gehel: i am about to start to run a complete HW test on the system before i call HP [16:21:07] papaul: that would have been too easy :) [16:21:10] (03PS1) 10Andrew Bogott: Cold-migrate: Copy directly between labvirts [puppet] - 10https://gerrit.wikimedia.org/r/344414 [16:21:34] gehel: i like it when it is easy lol [16:21:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344413 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [16:22:02] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3753976 keys, up 1 hours 22 minutes - replication_delay is 0 [16:22:07] !log Merged operations/puppet.git Jenkins job in a single one that runs tox then rake - T160923 [16:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:13] T160923: For operations/puppet : merge tox / rake jobs in a single job? - https://phabricator.wikimedia.org/T160923 [16:22:30] papaul: yeah, but I can handle easy problems... I need you for the hard ones... [16:22:52] (03CR) 10Giuseppe Lavagetto: [C: 032] site: move all jobqueue redis masters to the new model [puppet] - 10https://gerrit.wikimedia.org/r/344396 (owner: 10Giuseppe Lavagetto) [16:23:04] (03PS2) 10Andrew Bogott: Cold-migrate: Copy directly between labvirts [puppet] - 10https://gerrit.wikimedia.org/r/344414 [16:23:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344413 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [16:23:34] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344413 (https://phabricator.wikimedia.org/T137191) (owner: 10Marostegui) [16:24:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1082 weight - T137191 (duration: 00m 43s) [16:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:44] T137191: Defragment db1070, db1082, db1087, db1092 - https://phabricator.wikimedia.org/T137191 [16:25:33] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 759107 msg: ocg_render_job_queue 0 msg [16:25:41] gehel: that is not fair [16:25:43] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3125487 (10akosiaris) 05Open>03Resolved And with the above changes merged and shepherd into the fleet. I am gonna re-declare this resolved (for now). [16:25:57] gehel: ok the HW compete test will take 1 day 3hours [16:26:12] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [16:26:22] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [16:26:23] are you ok with the system being down that long? [16:26:31] _joe_: that's ^ you restarting stuff right ? [16:26:41] <_joe_> not really [16:26:44] papaul: no problem, we have enough capacity [16:27:04] papaul: and codfw is not active at the moment [16:27:12] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8462412 keys, up 3 minutes 12 seconds - replication_delay is 0 [16:27:22] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8370616 keys, up 3 minutes 18 seconds - replication_delay is 0 [16:27:40] <_joe_> akosiaris: that eervice is up and working, just had a far sync AFAICT [16:27:59] yeah it was doing a rdb-bgsave [16:28:27] gehel: ok [16:29:27] gehel: here is the first result i am getting from the test after 1% complete [16:29:52] <_joe_> akosiaris: that should not prevent connections though [16:29:58] gehel: Hard Drive Short DST check: WARNING [16:30:16] <_joe_> akosiaris: the new check is much better on the intra-cluster slaves btw [16:30:24] <_joe_> it seriously checks replication [16:30:36] <_joe_> so we might see issues that were masked before [16:30:54] "replication_delay is 97 " [16:31:21] papaul: sounds good! [16:31:26] <_joe_> jynus: that's my beloved check_redis.pl, with no units [16:31:32] PROBLEM - parsoid on wtp2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:22] RECOVERY - parsoid on wtp2019 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.132 second response time [16:36:32] PROBLEM - parsoid on wtp2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:48] subbu, _joe_: deploy done [16:37:22] RECOVERY - parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.129 second response time [16:37:42] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1490287058 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8454444 keys, up 2 minutes 43 seconds - replication_delay is 1490287058 [16:38:52] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:39:42] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8454309 keys, up 4 minutes 43 seconds - replication_delay is 0 [16:41:54] (03CR) 10Gehel: Update mwgrep for elasticsearch 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [16:44:54] (03PS9) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [16:45:09] <_joe_> !log reenabling puppet on all jobqueue redises [16:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:08] !log mobrovac@tin Started restart [parsoid/deploy@0c22f72]: (no justification provided) [16:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:16] (03PS2) 10Giuseppe Lavagetto: role::jobqueue_redis: deprecate in the current form [puppet] - 10https://gerrit.wikimedia.org/r/344397 [16:48:17] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3125553 (10Tobi_WMDE_SW) I can confirm that @GoranSMilovanovic works as a contractor for WMDE as I (in my role as Engineering Manager... [16:49:42] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:49:43] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:32] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:50:33] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [16:50:38] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3125561 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'll do that once the NDA is on file [16:51:02] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 504 (expecting: 200) [16:51:35] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3125565 (10MoritzMuehlenhoff) (Generally when onboarding someone new, feel free to simply add the LDAP group requests to the ticket r... [16:52:02] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [16:53:08] (03CR) 10Volans: [C: 032] Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:55:26] (03CR) 10EBernhardson: Update mwgrep for elasticsearch 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [16:55:28] (03PS2) 10EBernhardson: Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) [16:55:32] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:02] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3757147 keys, up 41 minutes 51 seconds - replication_delay is 651 [16:57:02] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 656 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3756748 keys, up 1 hours 57 minutes - replication_delay is 656 [16:57:08] (03PS3) 10Giuseppe Lavagetto: role::jobqueue_redis: deprecate in the current form [puppet] - 10https://gerrit.wikimedia.org/r/344397 [16:57:16] (03CR) 10Volans: [C: 032] Refactor check for mediawiki config [switchdc] - 10https://gerrit.wikimedia.org/r/344367 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:57:50] (03PS21) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [16:58:18] (03CR) 10Volans: [C: 032] Allow to execute integration tests from tox [switchdc] - 10https://gerrit.wikimedia.org/r/344368 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T1700). Please do the needful. [17:00:13] no parsoid deploy today [17:00:17] (03CR) 10Volans: [C: 032] Make all custom exceptions inherit from a common one [switchdc] - 10https://gerrit.wikimedia.org/r/344369 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:00:22] (03CR) 10Giuseppe Lavagetto: [C: 032] role::jobqueue_redis: deprecate in the current form [puppet] - 10https://gerrit.wikimedia.org/r/344397 (owner: 10Giuseppe Lavagetto) [17:01:06] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:22] (03CR) 10Volans: [C: 032] Moved all libraries in a single place [switchdc] - 10https://gerrit.wikimedia.org/r/344370 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:03:10] (03CR) 10Volans: [C: 032] Fix import error [switchdc] - 10https://gerrit.wikimedia.org/r/344371 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:03:15] (03Abandoned) 10BBlack: authdns: do not restart gdnsd on file changes [puppet] - 10https://gerrit.wikimedia.org/r/344144 (owner: 10Giuseppe Lavagetto) [17:04:32] (03CR) 10Volans: [C: 032] Uniform some docstrings and log formatting [switchdc] - 10https://gerrit.wikimedia.org/r/344372 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:06:22] (03PS1) 10Volans: Move all tasks to use the Remote class [switchdc] - 10https://gerrit.wikimedia.org/r/344423 (https://phabricator.wikimedia.org/T160178) [17:06:24] (03PS1) 10Volans: Move puppet-specific commands to it's module [switchdc] - 10https://gerrit.wikimedia.org/r/344424 (https://phabricator.wikimedia.org/T160178) [17:06:26] (03PS1) 10Volans: Make task execution more pythonic [switchdc] - 10https://gerrit.wikimedia.org/r/344425 (https://phabricator.wikimedia.org/T160178) [17:06:28] (03PS1) 10Volans: Use the Remote class everywhere [switchdc] - 10https://gerrit.wikimedia.org/r/344426 (https://phabricator.wikimedia.org/T160178) [17:06:30] (03PS1) 10Volans: Remote module: move everything inside Remote class [switchdc] - 10https://gerrit.wikimedia.org/r/344427 (https://phabricator.wikimedia.org/T160178) [17:06:32] (03PS1) 10Volans: Reorganize and update tests to the new structure [switchdc] - 10https://gerrit.wikimedia.org/r/344428 (https://phabricator.wikimedia.org/T160178) [17:08:55] (03CR) 10Filippo Giunchedi: [C: 031] "Change LGTM, though see my comment re: serveralias" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [17:11:45] (03CR) 10Gehel: postgresql - drop support for postgis 1.5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344176 (owner: 10Gehel) [17:12:20] (03PS2) 10Gehel: elasticsearch - enable shipping logs to logstash for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/342618 [17:13:23] (03CR) 10BBlack: "This will deploy (all entries here match discovery::services in https://gerrit.wikimedia.org/r/#/c/344088/1/hieradata/common/discovery.yam" [dns] - 10https://gerrit.wikimedia.org/r/344093 (owner: 10Giuseppe Lavagetto) [17:14:53] (03PS1) 10Eevans: Cope with client encryption when so-enabled [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/344431 [17:18:13] (03CR) 10Filippo Giunchedi: "I tested this locally on thumbor1001 and I can see a mixture of https and http traffic, expected?" [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles) [17:18:15] (03CR) 10DCausse: [C: 031] elasticsearch - enable shipping logs to logstash for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/342618 (owner: 10Gehel) [17:18:42] (03CR) 10EBernhardson: [C: 031] "seems sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/342618 (owner: 10Gehel) [17:19:12] (03PS3) 10Jcrespo: Revert "netboot.cfg: Format db1082" [puppet] - 10https://gerrit.wikimedia.org/r/344379 [17:24:08] (03CR) 10Jcrespo: [C: 032] Revert "netboot.cfg: Format db1082" [puppet] - 10https://gerrit.wikimedia.org/r/344379 (owner: 10Jcrespo) [17:24:35] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:25:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:28:05] PROBLEM - parsoid on wtp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:05] (03PS3) 10Gehel: elasticsearch - enable shipping logs to logstash for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/342618 [17:28:55] RECOVERY - parsoid on wtp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.163 second response time [17:29:05] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:30:10] (03CR) 10Gehel: [C: 032] elasticsearch - enable shipping logs to logstash for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/342618 (owner: 10Gehel) [17:31:59] !log restarting elasticsearch on elastic2002 to check new logging configuration [17:32:59] (03PS1) 10Ottomata: Use 3 parallel sqoop processors instead of 5 [puppet] - 10https://gerrit.wikimedia.org/r/344432 [17:33:05] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3753235 keys, up 1 hours 17 minutes - replication_delay is 35 [17:33:21] (03CR) 10Ottomata: [V: 032 C: 032] Use 3 parallel sqoop processors instead of 5 [puppet] - 10https://gerrit.wikimedia.org/r/344432 (owner: 10Ottomata) [17:33:55] (03CR) 10Filippo Giunchedi: [C: 032] Cope with client encryption when so-enabled [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/344431 (owner: 10Eevans) [17:34:53] (03PS3) 10Filippo Giunchedi: Migrate typos check to a rake task [puppet] - 10https://gerrit.wikimedia.org/r/342604 (https://phabricator.wikimedia.org/T119140) (owner: 10Hashar) [17:37:30] (03PS22) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [17:38:55] (03CR) 10Filippo Giunchedi: [C: 032] Migrate typos check to a rake task [puppet] - 10https://gerrit.wikimedia.org/r/342604 (https://phabricator.wikimedia.org/T119140) (owner: 10Hashar) [17:46:23] (03PS1) 10Giuseppe Lavagetto: profile::jobqueue_redis::master: refactoring [puppet] - 10https://gerrit.wikimedia.org/r/344435 [17:52:52] (03PS3) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 [17:52:57] !log update RESTBase to 9d2b393fb - staging [17:53:00] (03CR) 10jerkins-bot: [V: 04-1] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [17:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:05] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3753430 keys, up 1 hours 37 minutes - replication_delay is 602 [17:53:34] 06Operations: Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145#3125761 (10BBlack) I was trying to think of a way to do this that isn't quite as stateful as current `cron_splay`, but I haven't thought of a good one yet. If we... [17:53:41] (03PS1) 10Jcrespo: mariadb-core: Decouple Mariadb semi-sync replication from $::mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344442 (https://phabricator.wikimedia.org/T161007) [17:53:55] (03PS2) 10Jcrespo: mariadb-core: Decouple Mariadb semi-sync replication from $::mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/344442 (https://phabricator.wikimedia.org/T161007) [17:56:28] (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/#/c/344444/ will drop the job from CI entirely. Thanks for the git grep suggestion :-}" [puppet] - 10https://gerrit.wikimedia.org/r/342604 (https://phabricator.wikimedia.org/T119140) (owner: 10Hashar) [17:56:30] (03PS4) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 [17:57:37] (03PS11) 10Mbch331: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) [17:58:00] (03CR) 10Hashar: [C: 04-1] "Yup is is really just a dirty change that I have cherry picked on the deployment-prep puppet master. Will refactor it toward using class " [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T1800). Please do the needful. [18:00:04] Mbch331: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:53] (03PS2) 10Giuseppe Lavagetto: profile::jobqueue_redis::master: refactoring [puppet] - 10https://gerrit.wikimedia.org/r/344435 [18:01:05] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3753847 keys, up 1 hours 45 minutes - replication_delay is 0 [18:01:57] (03CR) 10Muehlenhoff: "The debdeploy change is fine (haven't looked at the other bits)" [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [18:02:05] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3754066 keys, up 3 hours 3 minutes - replication_delay is 0 [18:04:16] !log update RESTBase to 9d2b393fb - production [18:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:45] PROBLEM - parsoid on wtp2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:04:58] HI mbch331 [18:05:07] Hi Dereckson [18:05:31] Let's look at your change. [18:05:35] RECOVERY - parsoid on wtp2017 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.139 second response time [18:05:56] okay james and Urbanecm already checked that, perfect, let's deploy so [18:06:25] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:35] PROBLEM - parsoid on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:40] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) (owner: 10Mbch331) [18:07:18] mbch331: I'll push it first to mwdebug1002, so you can test it before it reaches prod [18:07:26] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.088 second response time [18:07:28] OK [18:07:32] mbch331: you've an extension for chrome and firefox to be able to test that: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [18:08:37] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::jobqueue_redis::master: refactoring [puppet] - 10https://gerrit.wikimedia.org/r/344435 (owner: 10Giuseppe Lavagetto) [18:10:07] We wait zuul to assign us a operations-mw-config-composer-hhvm-jessie slot. [18:10:07] Extension is installed [18:10:41] (03Merged) 10jenkins-bot: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) (owner: 10Mbch331) [18:10:50] (03CR) 10jenkins-bot: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) (owner: 10Mbch331) [18:10:56] Here we are [18:11:38] mbch331: okay, live on mwdebug1OO2, you can visit nl.wikipedia with the extension: put the button to on, and pick mwdebugl002 in the list, you should then be able to have the version without the sidebar exception [18:14:09] Checked and it works. Page looks normal and I found a page that shows the other sections sidebar with a link to commons [18:15:11] mbch331: okay [18:15:35] 30 error: unknown exception [18:15:49] sometimes, fatalmonitor messages are useful [18:15:56] unrelated with your change by the way [18:16:05] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Remove exception on Other Projects sidebar for Dutch Wikipedia (T159634) (duration: 00m 47s) [18:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:11] T159634: Enable Other projects sidebar on nlwiki - https://phabricator.wikimedia.org/T159634 [18:16:32] mbch331: here you are :-) [18:16:37] Thanks for the patch. [18:16:42] Thank you. I see it working on nlwiki. [18:17:08] You're welcome. [18:20:15] PROBLEM - parsoid on wtp2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:36] (03PS5) 10Gehel: Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) (owner: 10Smalyshev) [18:22:05] RECOVERY - parsoid on wtp2011 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.135 second response time [18:24:08] (03CR) 10Gehel: [C: 032] Direct LDF requests to single host to solve paging issues [puppet] - 10https://gerrit.wikimedia.org/r/344197 (https://phabricator.wikimedia.org/T159574) (owner: 10Smalyshev) [18:32:10] 06Operations: Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145#3125924 (10Volans) I agree with the principle, but we should also take into account the total distribution against the puppetmasters to avoid congestions and be ca... [18:35:25] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:42:13] {'cp1051.eqiad.wmnet': 'LLDPD NOT RUNNING'} [18:42:13] {'cp2009.codfw.wmnet': 'LLDPD NOT RUNNING'} [18:42:13] {'cp1061.eqiad.wmnet': 'LLDPD NOT RUNNING'} [18:42:18] oops! [18:51:43] !log systemctl enable+start of lldpd on cp2009, cp1051, cp1061 (mysteriously dead and disabled) [18:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T1900). Please do the needful. [19:00:15] * thcipriani does [19:00:30] * Reedy watches a train pull into the station outside his window [19:00:32] True story [19:02:24] sprayment MediaWiki on the side of the one of the cars and snap a video :) [19:02:45] err, s/sprayment/spraypaint/, my brain seems to be malfunctioning [19:02:47] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3126018 (10Volans) @jcrespo if I understand the patch correctly this means that we'll ac... [19:07:34] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3126038 (10Gehel) Varnish patch deployed. I'll keep an eye on logs to make sure all request are routed as we expect. We still need to f... [19:08:14] (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344473 [19:08:16] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344473 (owner: 10Thcipriani) [19:08:44] (03PS1) 10Ottomata: Updating otto's iterm2 shell integration script [puppet] - 10https://gerrit.wikimedia.org/r/344474 [19:09:04] (03CR) 10Ottomata: [V: 032 C: 032] Updating otto's iterm2 shell integration script [puppet] - 10https://gerrit.wikimedia.org/r/344474 (owner: 10Ottomata) [19:10:14] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344473 (owner: 10Thcipriani) [19:10:22] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344473 (owner: 10Thcipriani) [19:11:18] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.17 [19:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:35] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf 1.0.1-1 Debian package to apt.w.o - https://phabricator.wikimedia.org/T161239#3126056 (10Eevans) [19:11:50] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf 1.0.1-1 Debian package to apt.w.o - https://phabricator.wikimedia.org/T161239#3126070 (10Eevans) p:05Triage>03Normal [19:12:35] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:49] will monitor that deploy briefly and then roll out 1.29.0-wmf.17 to all groups if all looks good. [19:25:45] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:52] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3071930 (10Gehel) This is done. Longer term solution is tracked on T161240. [19:31:49] PROBLEM - zotero on sca1004 is CRITICAL: HTTP CRITICAL - No data received from host [19:32:49] RECOVERY - zotero on sca1004 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.010 second response time [19:34:55] !log thcipriani@tin Synchronized php: Swap symlink for 1.29.0-wmf.17 (duration: 00m 43s) [19:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:49] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:40:29] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:42:56] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc: generic profile for redis cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/344479 [19:42:58] (03PS1) 10Giuseppe Lavagetto: role::memcached: convert to use profile::multidc::redis [puppet] - 10https://gerrit.wikimedia.org/r/344480 [19:45:45] (03CR) 10jerkins-bot: [V: 04-1] role::memcached: convert to use profile::multidc::redis [puppet] - 10https://gerrit.wikimedia.org/r/344480 (owner: 10Giuseppe Lavagetto) [19:49:02] (03PS2) 10Volans: Move all tasks to use the Remote class [switchdc] - 10https://gerrit.wikimedia.org/r/344423 (https://phabricator.wikimedia.org/T160178) [19:50:53] (03PS1) 10Ottomata: Move data_drop and hive::site_hdfs roles from analytics1027 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344482 (https://phabricator.wikimedia.org/T159527) [19:51:09] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:22] (03PS2) 10Giuseppe Lavagetto: profile::redis::multidc: generic profile for redis cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/344479 [19:51:25] (03PS2) 10Ottomata: Move data_drop and hive::site_hdfs roles from analytics1027 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344482 (https://phabricator.wikimedia.org/T159527) [19:53:53] (03CR) 10jerkins-bot: [V: 04-1] Move data_drop and hive::site_hdfs roles from analytics1027 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344482 (https://phabricator.wikimedia.org/T159527) (owner: 10Ottomata) [19:53:59] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:55:02] (03PS3) 10Ottomata: Move data_drop and hive::site_hdfs roles from analytics1027 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344482 (https://phabricator.wikimedia.org/T159527) [19:56:30] (03PS3) 10Giuseppe Lavagetto: profile::redis::multidc: generic profile for redis cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/344479 [19:57:06] (03PS3) 10Rush: Cold-migrate: Copy directly between labvirts [puppet] - 10https://gerrit.wikimedia.org/r/344414 (owner: 10Andrew Bogott) [19:57:34] (03CR) 10Rush: [C: 031] "this seems right" [puppet] - 10https://gerrit.wikimedia.org/r/344414 (owner: 10Andrew Bogott) [20:00:25] (03PS4) 10Giuseppe Lavagetto: profile::redis::multidc: generic profile for redis cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/344479 [20:00:49] (03CR) 10Ottomata: [C: 032] Move data_drop and hive::site_hdfs roles from analytics1027 to analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344482 (https://phabricator.wikimedia.org/T159527) (owner: 10Ottomata) [20:00:51] (03CR) 10Andrew Bogott: [C: 032] Cold-migrate: Copy directly between labvirts [puppet] - 10https://gerrit.wikimedia.org/r/344414 (owner: 10Andrew Bogott) [20:01:00] (03PS4) 10Andrew Bogott: Cold-migrate: Copy directly between labvirts [puppet] - 10https://gerrit.wikimedia.org/r/344414 [20:01:22] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3126145 (10Smalyshev) 05Open>03Resolved [20:03:55] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/5888/rdb2005.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/344479 (owner: 10Giuseppe Lavagetto) [20:04:03] (03PS5) 10Giuseppe Lavagetto: profile::redis::multidc: generic profile for redis cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/344479 [20:04:16] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::redis::multidc: generic profile for redis cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/344479 (owner: 10Giuseppe Lavagetto) [20:05:00] (03PS1) 10Ottomata: Move hdfs balancer from an27 to an03 [puppet] - 10https://gerrit.wikimedia.org/r/344487 (https://phabricator.wikimedia.org/T159527) [20:05:15] (03PS2) 10Ottomata: Move hdfs balancer from an27 to an03 [puppet] - 10https://gerrit.wikimedia.org/r/344487 (https://phabricator.wikimedia.org/T159527) [20:05:45] (03CR) 10Ottomata: [V: 032 C: 032] Move hdfs balancer from an27 to an03 [puppet] - 10https://gerrit.wikimedia.org/r/344487 (https://phabricator.wikimedia.org/T159527) (owner: 10Ottomata) [20:06:49] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:12:46] (03PS1) 10Ottomata: Move camus jobs from analytics1027 to analytlics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344489 (https://phabricator.wikimedia.org/T159527) [20:13:53] (03PS1) 10Thcipriani: all wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344490 [20:13:55] (03CR) 10Thcipriani: [C: 032] all wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344490 (owner: 10Thcipriani) [20:15:05] (03CR) 10Ottomata: [C: 032] Move camus jobs from analytics1027 to analytlics1003 [puppet] - 10https://gerrit.wikimedia.org/r/344489 (https://phabricator.wikimedia.org/T159527) (owner: 10Ottomata) [20:15:11] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344490 (owner: 10Thcipriani) [20:16:12] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.17 [20:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:53] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344490 (owner: 10Thcipriani) [20:20:09] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:24:49] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [20:34:07] (03PS1) 10Ottomata: Install and run hue on throrium [puppet] - 10https://gerrit.wikimedia.org/r/344496 (https://phabricator.wikimedia.org/T159527) [20:36:20] (03CR) 10Ottomata: [V: 032 C: 032] Install and run hue on throrium [puppet] - 10https://gerrit.wikimedia.org/r/344496 (https://phabricator.wikimedia.org/T159527) (owner: 10Ottomata) [20:36:59] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:37:24] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3126365 (10Nirzar) @phuedx is our total delay more than 1000s? I thought it was around 750? @Peter >0.1 second: Limit for us... [20:38:55] (03CR) 10Volans: [C: 032] Move all tasks to use the Remote class [switchdc] - 10https://gerrit.wikimedia.org/r/344423 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [20:39:11] (03CR) 10Krinkle: [C: 031] Update mwgrep for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344044 (https://phabricator.wikimedia.org/T161055) (owner: 10EBernhardson) [20:39:42] (03PS1) 10Ottomata: Make sure analytics_cluster::apt is required before other packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/344498 [20:39:54] (03CR) 10Ottomata: [V: 032 C: 032] Make sure analytics_cluster::apt is required before other packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/344498 (owner: 10Ottomata) [20:41:09] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[oozie-client],Package[hadoop-client],Package[mahout] [20:41:13] (03CR) 10Volans: [C: 032] Move puppet-specific commands to it's module [switchdc] - 10https://gerrit.wikimedia.org/r/344424 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [20:41:19] (03PS2) 10Volans: Move puppet-specific commands to it's module [switchdc] - 10https://gerrit.wikimedia.org/r/344424 (https://phabricator.wikimedia.org/T160178) [20:44:24] (03PS2) 10Volans: Make task execution more pythonic [switchdc] - 10https://gerrit.wikimedia.org/r/344425 (https://phabricator.wikimedia.org/T160178) [20:46:21] (03CR) 10Volans: [C: 032] Make task execution more pythonic [switchdc] - 10https://gerrit.wikimedia.org/r/344425 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [20:46:56] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3126417 (10aaron) >>! In T156924#3101507, @Joe wrote: > - Values would be like (yes, it's json) > `... [20:49:55] (03PS2) 10Volans: Use the Remote class everywhere [switchdc] - 10https://gerrit.wikimedia.org/r/344426 (https://phabricator.wikimedia.org/T160178) [20:51:10] (03CR) 10Volans: [C: 032] Use the Remote class everywhere [switchdc] - 10https://gerrit.wikimedia.org/r/344426 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [20:55:09] PROBLEM - DPKG on thorium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:56:09] RECOVERY - DPKG on thorium is OK: All packages OK [20:59:01] PROBLEM - DPKG on thorium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:59:34] (03PS2) 10Volans: Remote module: move everything inside Remote class [switchdc] - 10https://gerrit.wikimedia.org/r/344427 (https://phabricator.wikimedia.org/T160178) [21:00:01] RECOVERY - DPKG on thorium is OK: All packages OK [21:00:11] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:01:01] (03CR) 10Volans: [C: 032] Remote module: move everything inside Remote class [switchdc] - 10https://gerrit.wikimedia.org/r/344427 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [21:02:51] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:01] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:05:13] (03PS2) 10Volans: Reorganize and update tests to the new structure [switchdc] - 10https://gerrit.wikimedia.org/r/344428 (https://phabricator.wikimedia.org/T160178) [21:06:30] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3126542 (10RobH) [21:07:10] 06Operations, 10Ops-Access-Requests: Requesting access to reseachers, analytics-wmde, analytics-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T160980#3117127 (10RobH) I've updated the task description with the checklist steps. We need a couple of things from @GoranSMilovanovic still: [] - s... [21:09:09] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3123873 (10RobH) Any request for deployers is a sudo request, and thus have to be reviewed in our weekly operations team meeting. Since I'm on clinic duty, I'll make sure this is... [21:10:11] 06Operations, 10Ops-Access-Requests, 10Gerrit: archiva-deploy password for Chad H. - https://phabricator.wikimedia.org/T161067#3120509 (10RobH) I think its fine to share, but if its not an emergency is it ok to just get this approved in our ops meeting next Monday? When its approved, then we can just gpg en... [21:14:25] (03CR) 10Volans: [C: 032] Reorganize and update tests to the new structure [switchdc] - 10https://gerrit.wikimedia.org/r/344428 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [21:14:35] (03CR) 10Alexandros Kosiaris: logstash: Filter ores logstash messages and set port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) (owner: 10Alexandros Kosiaris) [21:21:01] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:29:57] (03PS1) 10Krinkle: StartProfiler: Disable production sampling for xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344504 (https://phabricator.wikimedia.org/T110125) [21:30:26] thcipriani: train is done? [21:30:48] Need to roll out ^ as tungsten is running out of space and we don't use these profiles, so we can just stop them for now. [21:30:51] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:30:55] greg-g: ^ [21:31:31] * greg-g looks [21:31:43] Krinkle: looks like it [21:31:50] (03PS2) 10Krinkle: StartProfiler: Disable production sampling for xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344504 (https://phabricator.wikimedia.org/T161196) [21:32:15] (03CR) 10Krinkle: [C: 032] StartProfiler: Disable production sampling for xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344504 (https://phabricator.wikimedia.org/T161196) (owner: 10Krinkle) [21:33:59] (03Merged) 10jenkins-bot: StartProfiler: Disable production sampling for xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344504 (https://phabricator.wikimedia.org/T161196) (owner: 10Krinkle) [21:34:11] (03CR) 10jenkins-bot: StartProfiler: Disable production sampling for xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344504 (https://phabricator.wikimedia.org/T161196) (owner: 10Krinkle) [21:36:09] !log krinkle@tin Synchronized wmf-config/StartProfiler.php: (no justification provided) (duration: 00m 53s) [21:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:52] Krinkle: sorry walked away in search of coffee. Train is indeed done (as you saw) [21:37:06] Thx [21:42:44] (03PS1) 10EBernhardson: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 [21:44:17] (03CR) 10EBernhardson: "came across the issue today that we couldn't directly copy an index from codfw into relforge, because it wasn't whitelisted in the config." [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [21:45:29] (03CR) 10Smalyshev: [C: 031] Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [21:46:14] (03PS2) 10Madhuvishy: new cert for *.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/342254 (owner: 10RobH) [21:46:53] (03CR) 10Ladsgroup: [C: 031] "Okay, great" [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) (owner: 10Alexandros Kosiaris) [21:48:05] (03CR) 10Madhuvishy: [C: 032] new cert for *.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/342254 (owner: 10RobH) [21:49:01] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:49:06] (03CR) 10Andrew Bogott: [C: 032] labtest: labtestmetal convert labtestvirt2002 [dns] - 10https://gerrit.wikimedia.org/r/344194 (owner: 10Rush) [21:49:08] (03PS2) 10EBernhardson: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 [21:49:59] 06Operations, 10Ops-Access-Requests, 06Performance-Team: Grant perf-roots access to tungsten - https://phabricator.wikimedia.org/T161261#3126835 (10Krinkle) [21:50:51] (03PS1) 10Krinkle: Grant admin rights on tungsten to perf-roots [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) [21:51:06] marostegui: Hey, if you're around: regarding https://phabricator.wikimedia.org/T151681 do we have a metric in grafana/graphite for average lock time of dispatcher? We are introducing a new method of locking mechanism and I want to know if it's going to have the impact we want [21:51:57] (03CR) 10Andrew Bogott: [C: 031] labtest: convert labtestmetal to labtestvirt2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [21:52:17] (03PS2) 10Andrew Bogott: labtest: convert labtestmetal to labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [21:55:06] (03PS3) 10Andrew Bogott: labtest: convert labtestmetal to labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [21:56:12] (03PS4) 10Andrew Bogott: labtest: convert labtestmetal to labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [21:58:13] (03CR) 10Andrew Bogott: [C: 032] labtest: convert labtestmetal to labtestvirt2001 [puppet] - 10https://gerrit.wikimedia.org/r/344193 (owner: 10Rush) [22:00:45] mutante: It used to be that I refreshed puppet on Carbon before pxe-booting a new machine… what box is in charge of pxe images and partman and such now? [22:01:34] andrewbogott i think mutante is away today. [22:01:42] seems so [22:01:49] yep [22:04:11] PROBLEM - Host labtestmetal2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:06:22] andrewbogott: depends on the DC, install[12]002 and the bastions on the other DCs IIRC [22:07:01] RECOVERY - Host labtestmetal2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [22:07:09] volans: ok, thanks [22:07:32] see also the email from daniel [22:08:11] PROBLEM - parsoid on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:55] (03PS1) 10Subramanya Sastry: Visualdiff: Install indic and other font packages [puppet] - 10https://gerrit.wikimedia.org/r/344551 [22:09:01] PROBLEM - dhclient process on labtestmetal2001 is CRITICAL: Return code of 255 is out of bounds [22:09:01] RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.137 second response time [22:09:21] PROBLEM - DPKG on labtestmetal2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:09:56] (03CR) 10Subramanya Sastry: "Okay, I think I've updated the few wiki references to the old domain. There are a whole lot of references to the old domain in gerrit patc" [puppet] - 10https://gerrit.wikimedia.org/r/343948 (owner: 10Dzahn) [22:10:17] (03CR) 10Subramanya Sastry: [C: 031] misc-varnish/parsoid-tests: remove parsoid-tests backend [puppet] - 10https://gerrit.wikimedia.org/r/343948 (owner: 10Dzahn) [22:12:18] (03CR) 10Volans: "Puppet compiler results available here: https://puppet-compiler.wmflabs.org/5890/tungsten.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [22:14:42] (03CR) 10Muehlenhoff: Visualdiff: Install indic and other font packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [22:14:50] 06Operations, 10hardware-requests: Rename labtestmetal2001 - https://phabricator.wikimedia.org/T161265#3126920 (10Andrew) [22:20:12] (03CR) 10Subramanya Sastry: Visualdiff: Install indic and other font packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [22:31:51] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [22:36:26] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3127011 (10madhuvishy) 05Open>03Resolved Is done now! [22:39:48] !log update RESTBase to 2536b25c7 - codfw [22:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:01] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:45:34] (03PS2) 10Subramanya Sastry: Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 [22:46:56] (03CR) 10jerkins-bot: [V: 04-1] Visualdiff: Install mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/344551 (owner: 10Subramanya Sastry) [22:50:31] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:53:51] PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:56:33] !log update RESTBase to 2536b25c7 - staging [22:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:35] PROBLEM - configured eth on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [22:59:55] PROBLEM - dhclient process on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170323T2300). Please do the needful. [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:05] PROBLEM - kvm ssl cert on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:00:25] PROBLEM - nova-compute process on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:00:28] jouncebot: thank you for the sleep reminder!! [23:00:35] PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:00:45] PROBLEM - salt-minion processes on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:01:55] PROBLEM - DPKG on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:02:15] PROBLEM - Disk space on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:02:25] PROBLEM - MD RAID on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:02:38] I can SWAT [23:03:04] RoanKattouw: ping me when you're around and I can get your change out [23:03:58] thcipriani: Sorry, got distracted. Go ahead with the first step, I'll be back at my desk in a couple minutes [23:04:05] okie doke [23:04:34] 06Operations, 10Domains, 10Traffic, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3127084 (10Dzahn) 05Open>03Resolved Great, thanks for confirming it works. Resolving this ticket now. [23:05:57] !log update RESTBase to 2536b25c7 - eqiad [23:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:42] thcipriani: OK, at my desk now [23:06:52] Lemme know when it's on 1002 [23:06:55] or wherever [23:07:02] will do [23:08:05] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:08:17] RoanKattouw: pulled on mwdebug1002 now [23:10:00] thcipriani: LGTM [23:10:11] ok, going live [23:10:35] PROBLEM - MPT RAID on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [23:12:46] !log thcipriani@tin Synchronized php-1.29.0-wmf.17/extensions/ORES: SWAT: [[gerrit:344555|Stats: Invert "false" thresholds so they are correct]] T161250 (duration: 00m 52s) [23:12:53] ^ RoanKattouw live everywhere [23:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:54] T161250: Good faith test does not work properly - https://phabricator.wikimedia.org/T161250 [23:12:56] Yay thanks [23:18:35] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:21:45] RECOVERY - puppet last run on aluminium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:22:25] RECOVERY - MD RAID on labtestvirt2002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:22:35] RECOVERY - configured eth on labtestvirt2002 is OK: OK - interfaces up [23:22:45] RECOVERY - salt-minion processes on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:22:55] RECOVERY - dhclient process on labtestvirt2002 is OK: PROCS OK: 0 processes with command name dhclient [23:22:55] RECOVERY - DPKG on labtestvirt2002 is OK: All packages OK [23:23:05] RECOVERY - dhclient process on labtestmetal2001 is OK: PROCS OK: 0 processes with command name dhclient [23:23:05] RECOVERY - kvm ssl cert on labtestvirt2002 is OK: Cert /etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt will not expire for at least 90 days [23:23:15] RECOVERY - Disk space on labtestvirt2002 is OK: DISK OK [23:23:25] RECOVERY - DPKG on labtestmetal2001 is OK: All packages OK [23:25:34] 06Operations, 10Ops-Access-Requests: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3127143 (10Pnorman) [23:26:23] thcipriani: I'm gonna try cherry-picking some Wikibase changes that I also wanted to get into the SWAT, mind if I do that (and deploy them myself) while the SWAT window is still open? [23:26:47] !log Removing xhgui.results entries from before 1 June 2016 (T161196) [23:26:49] RoanKattouw: I don't mind, go for it [23:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:54] Cool thanks [23:26:54] T161196: tungsten is out of space on /srv - https://phabricator.wikimedia.org/T161196 [23:29:46] 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Grant perf-roots access to tungsten - https://phabricator.wikimedia.org/T161261#3126819 (10Dzahn) Did tungsten have a different role before? Was that role::webperf? Because that still includes the perf-roots admin group but itsel... [23:30:35] RECOVERY - MPT RAID on labtestvirt2002 is OK: OK [23:34:15] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [23:37:04] (03PS1) 10Dzahn: librenms/smokeping/torrus: add Bacula backup sets [puppet] - 10https://gerrit.wikimedia.org/r/344560 (https://phabricator.wikimedia.org/T125020) [23:38:20] (03PS1) 10Dzahn: bacula: add /srv/librenms to librenms backup file set [puppet] - 10https://gerrit.wikimedia.org/r/344561 (https://phabricator.wikimedia.org/T125020) [23:44:21] (03CR) 10Dzahn: [C: 032] bacula: add /srv/librenms to librenms backup file set [puppet] - 10https://gerrit.wikimedia.org/r/344561 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [23:44:54] (03CR) 10Dzahn: [C: 032] librenms/smokeping/torrus: add Bacula backup sets [puppet] - 10https://gerrit.wikimedia.org/r/344560 (https://phabricator.wikimedia.org/T125020) (owner: 10Dzahn) [23:44:55] PROBLEM - DPKG on labtestvirt2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:54:25] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [23:55:55] RECOVERY - DPKG on labtestvirt2002 is OK: All packages OK [23:56:25] RECOVERY - nova-compute process on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute