[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T0000). [00:03:57] (03PS1) 10Dzahn: admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) [00:05:51] (03PS2) 10BryanDavis: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 [00:11:57] (03CR) 10BryanDavis: "I have cherry-picked the patch to deployment-puppetmaster (and fixed a syntax error from PS1)." [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [00:48:06] (03CR) 10Jforrester: [C: 04-1] Initialize configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [00:49:15] (03CR) 10Dereckson: "Could you follow 9483358b3f80d85c2e5be1515a265a5b512f132f for commit message format?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [00:51:16] (03CR) 10Dereckson: [C: 04-1] Initialize configuration for tcy.wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [00:52:28] (03CR) 10Dereckson: Initialize configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [01:06:27] (03PS3) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [01:09:57] (03CR) 10Paladox: Initial configuration for tcy.wikipedia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [01:10:03] (03PS4) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [01:10:10] (03CR) 10Dereckson: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [01:11:19] (03CR) 10Paladox: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [01:12:38] (03CR) 10Dereckson: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [01:12:48] (03CR) 10Dereckson: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [01:13:48] (03PS5) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [01:17:35] 06Operations, 10ops-codfw, 10netops: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2481487 (10faidon) @RobH, try `show lldp neighbors` (with or without `| match ge-4` at the end). [01:26:24] PROBLEM - MD RAID on mw1259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:26:25] PROBLEM - SSH on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:14] RECOVERY - MD RAID on mw1259 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:28:16] RECOVERY - SSH on mw1259 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [01:35:31] (03PS3) 10Chad: WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 [01:40:14] (03PS4) 10Chad: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 [01:40:38] (03CR) 10Chad: [C: 031] "Yay https://puppet-compiler.wmflabs.org/3420/" [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [01:41:22] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [02:23:45] (03CR) 10Krinkle: [C: 04-1] "404 Not Found /static/images/project-logos/tcywiki.png." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [02:25:25] (03CR) 10Krinkle: "Please download a correctly sized rendering of the SVG logo in both 1x and 2x size, run through an optimiser (e.g. zopflipng, or ImageOpti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [02:30:58] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.10) (duration: 09m 33s) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:53:39] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 4 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2482696 (10aaron) 05Open>03Resolved According to [[ https://logstash.wikimedia.... [02:56:19] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 08m 57s) [02:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jul 21 03:03:21 UTC 2016 (duration 7m 2s) [03:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:25] (03CR) 10Dzahn: [C: 031] Add Bryan to labtest roots. [puppet] - 10https://gerrit.wikimedia.org/r/299959 (https://phabricator.wikimedia.org/T140830) (owner: 10Gehel) [03:13:49] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Dzahn) 2 of 3 users are good to go now. We just need a wikitech user for "bcohn" to finalize this. [03:29:39] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2472440 (10Dzahn) already removed from DNS in July 2015 and don't see anything in puppet either. --- commit 4c46ff39f1071816d8ed865d93d66daf3b3fc929 Author: jgr... [03:31:16] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2482722 (10Dzahn) only mgmt dns is left, since cables have been removed.. we can remove that too [03:32:48] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2482723 (10Dzahn) oh wait, you mean "aluminium.**frack.**eqiad.wmnet" (too) right [03:38:41] /wmf/dns$ git rebase --continue [03:38:41] fatal: update_ref failed for ref 'refs/heads/master': cannot lock ref 'refs/heads/master': ref refs/heads/master is at afda28cb6ce31bf058b662cc352fef91029ab921 but expected b49609114be919f8129ac6c464dfcfdbc56c61f3 [03:38:45] Successfully rebased and updated refs/heads/master. [03:38:50] fatal AND succesful.. yay [03:39:21] welcome to git! [03:39:24] http://latkin.org/blog/2016/07/20/git-for-windows-accidentally-creates-ntfs-alternate-data-streams/ [03:42:35] MaxSem: lolwut [03:44:05] MaxSem: correction: lol:wut :) [03:47:45] (03PS1) 10Dzahn: remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) [03:48:17] (03PS2) 10Dzahn: remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) [03:49:51] wanted to add Jeff as reviewer in gerrit, typing J.. waiting for autocomplete hit enter, but i got "JavaScript" instead.. that adds like 20 unrelated people at once .. oops :) [04:02:07] 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2459518 (10Dzahn) just some technical notes: osmdb.eqiad.wmnet is an alias for labsdb1006.eqiad.wmnet cheat sheet for shp2pgs... [04:19:42] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482774 (10Dzahn) copying verbatim comment from @Glaisher on T134017#2253719 --- Could someone provide the translations for the namespace names? If po... [04:31:42] (03PS1) 10Dzahn: restbase: add new tcy.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) [04:35:17] (03PS1) 10Dzahn: labs dnsrecursor: add tcy.wiki(pedia) [puppet] - 10https://gerrit.wikimedia.org/r/300215 (https://phabricator.wikimedia.org/T140898) [04:36:16] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482807 (10Dzahn) [05:52:02] (today I'll be afk :) [06:07:28] (03PS2) 10Dzahn: admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) [06:18:25] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482825 (10Dzahn) @AlexMonk-WMF is an Interwiki cache update like https://gerrit.wikimedia.org/r/#/c/286552/1 needed for this as well? [06:30:34] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [06:30:34] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:33] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482840 (10Dzahn) @Aude could we have a change like https://gerrit.wikimedia.org/r/#/c/288097/4 for "tcy"? [06:31:43] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:45] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10greg) rOPUP:modules/toollabs/manifests/dev_environ.pp already has differences for what is installed and not just version, but software themselves (eg... [06:34:18] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482844 (10Dzahn) also needed: - messages (https://gerrit.wikimedia.org/r/#/c/286556/) - database replica labs (DBA) [06:34:43] (03PS1) 10ArielGlenn: fix up xmlstubs batch jobs setting for en wiki xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/300224 (https://phabricator.wikimedia.org/T132279) [06:36:57] (03CR) 10ArielGlenn: [C: 032] fix up xmlstubs batch jobs setting for en wiki xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/300224 (https://phabricator.wikimedia.org/T132279) (owner: 10ArielGlenn) [06:42:53] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 3 failures [06:50:29] _joe_: I'm seeing error: RPC failed; result=22, HTTP code = 503 [06:50:29] fatal: The remote end hung up unexpectedly [06:50:29] for both strontium and rhodium on puppet-merge from palladium [06:50:32] any ideas? [06:50:50] I get the same hangup when I try puppet-merge from strontium, takes quite a while to fail in both cases [06:51:15] <_joe_> apergos: no idea, that's clearly not related to my past work on puppet [06:51:32] <_joe_> seems like gerrit issues tbh [06:51:52] * apergos grumbles some [06:51:59] <_joe_> I would look at what git does in puppet-merge [06:52:09] <_joe_> then run it with at least GIT_TRACE=1 [06:52:23] <_joe_> sorry, gotta go run an errand in 2 minutes [06:52:43] see ya [06:53:57] <_joe_> apergos: yeah it seems it's gerrit [06:55:53] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:03] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:04] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:12] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:13] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:43] apergos: getting the same in beta code updates: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/113762/console [07:05:50] nice [07:05:52] see 6:58 [07:06:13] I was just about to report a bug, but I should sleep (it's 00:06 here), can you? [07:07:16] still pulling from gerrit right? [07:07:28] not going to bug report it, going to try to kick it somehow and fix the issue [07:07:38] we can't have no puppet changes going in today, that's no good [07:07:43] go sleep, greg-g [07:08:03] touche, see, I'm sleepy [07:08:06] thanks [07:33:27] Hi Operations. Can someone explain me that task and why it is important for general audience to know about it? https://phabricator.wikimedia.org/T86096 [07:38:05] (03CR) 10KartikMistry: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem) [07:42:23] <_joe_> Trizek: I guess you mean why it was tagged "user-notice"? [07:42:49] Yes _joe_. [07:43:10] <_joe_> when changing the version of the ICU library hhvm is linked against, that would change the way some pages are rendered before we ran a script [07:43:15] I need to understand what is it about to see how to include it on Tech News. [07:43:16] <_joe_> so users would notice the issue [07:43:30] <_joe_> Trizek: it has happened 2 months ago or so? [07:43:38] You are speaking Klingon to me, I'm afraid :) [07:44:30] <_joe_> Trizek: so, one typical effect was https://phabricator.wikimedia.org/T136281 [07:44:41] <_joe_> we upgraded the application, and then had to run a script [07:45:03] <_joe_> until that script finished running, some issues were observable on the wikis [07:45:55] <_joe_> but again, all of that finished in may [07:46:03] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:46:03] So this is now fixed? [07:46:05] <_joe_> it's two months ago [07:46:07] <_joe_> Trizek: yes [07:46:44] So basically, it doesn't need to be announced. [07:46:47] <_joe_> isn't the ticket resolved? [07:46:51] <_joe_> yes, no need [07:46:52] RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge. [07:47:01] !log restarted gerrit on ytterbium, it was refusing to complete git fetches for large repos (mw core, puppet...) [07:47:03] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [07:47:03] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:19] 06Operations, 07HHVM, 13Patch-For-Review: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2482909 (10Trizek-WMF) [07:47:38] Tanks a lot for your explanations _joe_ ! [07:48:02] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [07:48:34] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:51:40] <_joe_> Trizek: you're welcome :) [07:54:42] <_joe_> I might break puppet in a bit [07:54:50] <_joe_> as in breaking the puppetmaster [07:55:02] <_joe_> uhm grrrt-wm is off as well [07:55:05] <_joe_> let's kick it [07:58:34] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 31 probes of 399 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:58:45] (03CR) 10Mobrovac: [C: 031] restbase: add new tcy.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [08:03:05] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 (owner: 10Giuseppe Lavagetto) [08:04:04] <_joe_> some puppet failures will be inevitable [08:04:33] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 399 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:10:33] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2482928 (10Gehel) [08:10:39] <_joe_> !log restarting apache on palladium [08:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:53] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures [08:14:03] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:14:52] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail [08:15:02] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: puppet fail [08:15:03] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: puppet fail [08:15:03] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 7 failures [08:15:13] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: puppet fail [08:15:13] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: puppet fail [08:15:23] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: Puppet has 6 failures [08:15:23] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Puppet has 17 failures [08:15:32] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 6 failures [08:15:33] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 3 failures [08:15:33] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: puppet fail [08:15:33] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 10 failures [08:15:34] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: puppet fail [08:15:42] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: puppet fail [08:15:43] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: puppet fail [08:15:43] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail [08:15:52] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Puppet has 9 failures [08:15:54] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 6 failures [08:16:03] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Puppet has 8 failures [08:16:12] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail [08:16:12] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 35 failures [08:16:12] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail [08:16:13] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: puppet fail [08:16:13] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: puppet fail [08:16:22] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 10 failures [08:16:22] PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: puppet fail [08:16:23] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 7 failures [08:16:23] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Puppet has 9 failures [08:16:32] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: puppet fail [08:16:32] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Puppet has 11 failures [08:16:33] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail [08:16:42] PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: Puppet has 4 failures [08:16:44] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 6 failures [08:16:44] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 6 failures [08:16:52] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 13 failures [08:17:02] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 9 failures [08:17:13] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 11 failures [08:17:15] <_joe_> expected, I restarted apache on palladium [08:17:33] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 10 failures [08:17:33] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 9 failures [08:17:43] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 6 failures [08:17:44] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 6 failures [08:21:45] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2482951 (10jcrespo) I had that very same problem with the old disk, but I assumed it was because it had failed. :-( Let me see if I see anything else bad. [08:24:52] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#2482954 (10hashar) There is still role::parsoid::beta left over. We probably want to audit what is left in puppet.git but afaik there is nothing left to do. [08:37:43] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:40:25] 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2482979 (10hashar) I havent seen that occurr... [08:40:53] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:41:13] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:41:22] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:41:32] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [08:41:32] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:41:33] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:41:33] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:41:42] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:41:43] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:41:43] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:41:53] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:42:02] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:03] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:42:03] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:42:12] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:42:13] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:13] RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:42:22] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:42:22] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:42:23] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:42:23] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:32] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:32] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:42] RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:42:52] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:42:52] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:54] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:54] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:54] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:03] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:03] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:04] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:12] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:33] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:33] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:43:33] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:33] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:43:42] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:43] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:03] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:44:14] ACKNOWLEDGEMENT - HP RAID on ms-be1027 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:4:1, 2I:4:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor Filippo Giunchedi waiting on replacement/diagnose, T140374 [08:44:14] ACKNOWLEDGEMENT - MD RAID on ms-be1027 is CRITICAL: CRITICAL: Active: 11, Working: 11, Failed: 1, Spare: 0 Filippo Giunchedi waiting on replacement/diagnose, T140374 [08:44:43] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:47:28] 06Operations: reinstall snapshot1001.eqiad.wmnet with RAID, decomm snapshot1002,3,4 - https://phabricator.wikimedia.org/T140439#2483012 (10ArielGlenn) a:03ArielGlenn [08:47:50] 06Operations: reinstall snapshot1001.eqiad.wmnet with RAID, decomm snapshot1002,3,4 - https://phabricator.wikimedia.org/T140439#2464872 (10ArielGlenn) [08:48:43] 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2483032 (10Gehel) p:05Triage>03Normal @BBlack, @ema: varnish is your domain, any opinion on this request for access? It seems that currently access to cp* servers is fairly restri... [08:50:44] RECOVERY - Disk space on ms-be3004 is OK: DISK OK [08:52:48] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2483038 (10Gehel) p:05Triage>03Low Triaging this as low priority to match T117470. [08:54:27] (03CR) 10Nikerabbit: [C: 031] Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem) [08:54:58] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2483042 (10Gehel) p:05Triage>03Normal [08:55:35] (03CR) 10Filippo Giunchedi: [C: 031] Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [09:00:48] (03CR) 10Filippo Giunchedi: [C: 031] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [09:01:23] (03PS1) 10Giuseppe Lavagetto: puppetmaster: fix test vhost proxy auth [puppet] - 10https://gerrit.wikimedia.org/r/300234 [09:04:15] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: fix test vhost proxy auth [puppet] - 10https://gerrit.wikimedia.org/r/300234 (owner: 10Giuseppe Lavagetto) [09:04:20] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2483051 (10fgiunchedi) serpens still shows some memory growth, possibly not fixed yet {F4293977} [09:05:23] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago [09:07:22] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:08:42] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:12:37] 06Operations, 10ops-codfw, 10media-storage: ms-be2017 failed disk - https://phabricator.wikimedia.org/T140948#2483058 (10fgiunchedi) 05Open>03Invalid I'm not seeing the errors reported in icinga for ms-be2027, I think this was ms-be1027 i.e. {T140374} [09:13:14] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Puppet has 1 failures [09:13:27] (03CR) 10MarcoAurelio: [C: 04-1] "If Visual Editor is to be enabled there, then the wiki should be added to dblists/visualeditordefault.dblist I think." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [09:14:03] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 1 failures [09:16:19] (03PS3) 10Gehel: Configure new relevance forge servers [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) [09:19:13] (03CR) 10Gehel: [C: 032] "Reviewed with Erik, LVS will come as a second step. Looks good otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [09:24:33] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden [09:25:53] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: puppet fail [09:26:42] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.057 second response time [09:27:04] ^relforge is me... checking ... [09:31:36] (03PS1) 10Filippo Giunchedi: add thumbor service IPs [dns] - 10https://gerrit.wikimedia.org/r/300240 (https://phabricator.wikimedia.org/T139606) [09:33:03] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden [09:36:14] (03PS1) 10Gehel: Adding rack information for new relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/300241 (https://phabricator.wikimedia.org/T137256) [09:38:33] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:39:44] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:48] 06Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#2483113 (10ArielGlenn) How does http://dumps.wikimedia.your.org/ perform? I can ask them about their routing but I know all requests come to and are served from a h... [09:42:29] (03CR) 10Gehel: [C: 032] Adding rack information for new relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/300241 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [09:44:10] 06Operations, 10MediaWiki-General-or-Unknown: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2483121 (10abian) Today, this 503 error raises again with the corresponding URL (different diff and different oldid, but the same page)... https://www.wikidata.... [09:47:51] !log reinstalling and configuring relforge1001/1002 - T137256 [09:47:52] T137256: Setup two node elasticsearch cluster on relforge1001-1002 - https://phabricator.wikimedia.org/T137256 [09:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:38] (03PS3) 10Addshore: RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) [09:51:15] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2483131 (10jcrespo) >>! In T140898#2482844, @Dzahn wrote: > also needed: > > - messages (https://gerrit.wikimedia.org/r/#/c/286556/) > > - database re... [09:53:18] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2483135 (10jcrespo) I cannot do the first until the database is created. The second depend on this. [09:55:02] (03PS1) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 [10:16:30] (03PS1) 10Filippo Giunchedi: lvs: add thumbor to lvs [puppet] - 10https://gerrit.wikimedia.org/r/300244 (https://phabricator.wikimedia.org/T139606) [10:23:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM for now, but please see my comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300244 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [10:24:03] (03PS2) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 [10:29:16] (03PS1) 10ArielGlenn: fix link to current set of cirrus search dumps [puppet] - 10https://gerrit.wikimedia.org/r/300246 (https://phabricator.wikimedia.org/T138176) [10:35:07] !log cr2-eqiad: increase cross-datacenter link OSPF metrics [10:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:27] (03PS3) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 [10:38:25] (03PS2) 10ArielGlenn: fix link to current set of cirrus search dumps [puppet] - 10https://gerrit.wikimedia.org/r/300246 (https://phabricator.wikimedia.org/T138176) [10:40:50] (03CR) 10ArielGlenn: [C: 032] fix link to current set of cirrus search dumps [puppet] - 10https://gerrit.wikimedia.org/r/300246 (https://phabricator.wikimedia.org/T138176) (owner: 10ArielGlenn) [10:50:43] !log cr2-eqiad: deactivating IX BGP sessions [10:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:50] !log cr2-eqiad: deactivating Transit BGP sessions [10:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:51] (03PS4) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 [10:55:12] !log cr2-eqiad: deactivating Fundraising BGP session [10:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:54] !log cr2-eqiad: disabling IX/Transit/Fundraising interfaces [10:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:04] paravoid: Dear anthropoid, the time has come. Please deploy network maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1100). [11:00:38] PROBLEM - Host lutetium is DOWN: PING CRITICAL - Packet loss = 100% [11:01:08] <_joe_> I guess this is expected paravoid [11:01:09] is that expected? [11:03:06] it's not :/ [11:03:15] just one frack host? [11:03:20] <_joe_> seems so [11:03:45] mismatch of some acl or so? [11:03:49] otoh, acls are mostly on the SRX [11:03:57] I can ping it from neon, weird [11:05:39] ah, it's its public IP, 208.80.155.13 [11:05:45] not lutetium.frack.eqiad.wmnet (10.64.40.111) [11:05:46] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 106, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-15/0/0: down - cr2-eqiad:xe-5/0/3BR [11:10:03] I don't see it [11:10:11] all looks good really [11:10:34] (03PS5) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 [11:11:00] i can't even login on pfw1 [11:11:14] probably some stupid acl [11:12:08] pretty sure it's just some NAT stupidity on the SRX [11:12:15] probably [11:13:22] lutetium sees the packet and replies [11:13:26] 11:13:23.946346 IP 10.64.40.111 > 208.80.154.14: ICMP echo reply, id 22481, seq 15, length 64 [11:14:23] all the rest works [11:14:31] I'll proceed with the cr2-eqiad window [11:14:35] ok [11:14:55] (03PS2) 10Giuseppe Lavagetto: Change-Prop: Fix error ignoring config bug [puppet] - 10https://gerrit.wikimedia.org/r/300166 (owner: 10Ppchelko) [11:15:10] !log cr2-eqiad: deactivate chassis redundancy graceful-switchover [11:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-Prop: Fix error ignoring config bug [puppet] - 10https://gerrit.wikimedia.org/r/300166 (owner: 10Ppchelko) [11:15:56] (03PS2) 10Giuseppe Lavagetto: Change-prop: Ignore bot edits on ORES precache updates. [puppet] - 10https://gerrit.wikimedia.org/r/300108 (owner: 10Ppchelko) [11:16:06] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-prop: Ignore bot edits on ORES precache updates. [puppet] - 10https://gerrit.wikimedia.org/r/300108 (owner: 10Ppchelko) [11:17:10] <_joe_> mobrovac: running puppet on scb* [11:17:26] kk, i;ll restart afterwards [11:17:41] <_joe_> puppet has ran [11:24:43] !log upgrading cr2-eqiad:re0 and rebooting [11:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:16] 06Operations, 10Monitoring, 13Patch-For-Review: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#2483193 (10ema) @elukey : that's right, we're simply sending gauges instead of counters but the behavior of `Collector.derivative()` still needs to be investigated. [11:28:05] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [11:28:16] PROBLEM - configured eth on relforge1001 is CRITICAL: Connection refused by host [11:28:16] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.21, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [11:28:17] PROBLEM - Elasticsearch HTTPS on relforge1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [11:28:25] PROBLEM - MD RAID on relforge1001 is CRITICAL: Connection refused by host [11:28:25] PROBLEM - salt-minion processes on relforge1001 is CRITICAL: Connection refused by host [11:28:37] PROBLEM - Elasticsearch HTTPS on relforge1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [11:28:45] PROBLEM - NTP on relforge1001 is CRITICAL: NTP CRITICAL: No response from NTP server [11:28:56] PROBLEM - dhclient process on relforge1001 is CRITICAL: Connection refused by host [11:29:06] PROBLEM - Check size of conntrack table on relforge1001 is CRITICAL: Connection refused by host [11:29:27] relforge? [11:29:36] PROBLEM - Disk space on relforge1001 is CRITICAL: Connection refused by host [11:29:45] PROBLEM - DPKG on relforge1001 is CRITICAL: Connection refused by host [11:32:59] looks like a new host, silenced [11:33:56] (03PS1) 10Ema: cache_upload: do not set Access-Control-Allow-Origin twice [puppet] - 10https://gerrit.wikimedia.org/r/300249 [11:34:07] (03CR) 10Mobrovac: [C: 031] Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 (owner: 10MaxSem) [11:38:27] !log cr2-eqiad: toggling mastership between routing-engines (re1->re0) [11:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:46] PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197) [11:43:05] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 106, down: 0, dormant: 0, excluded: 1, unused: 0 [11:43:57] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:44:16] RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [11:44:31] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#2483220 (10mobrovac) https://gerrit.wikimedia.org/r/#/c/300067/ addresses this. Will amend the commit to link it to this bug too. [11:45:09] (03PS2) 10Mobrovac: Parsoid: clean up the manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/300067 (https://phabricator.wikimedia.org/T90668) [11:45:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [11:49:16] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 106, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-15/0/0: down - cr2-eqiad:xe-5/0/3BR [11:49:16] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms [11:49:33] !log upgrading cr2-eqiad:re1 and rebooting [11:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:51:35] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 97 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:53:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:54:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:57:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:59:07] 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10AlexMonk-WMF) I saw it just a few... [12:03:27] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10AlexMonk-WMF) >>! In T140898#2482825, @Dzahn wrote: > @AlexMonk-WMF is an Interwiki cache update like https://gerrit.wikimedia.org/r/#/c/2865... [12:07:09] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: puppet fail [12:07:58] logstash/kibana is not loading [12:08:48] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [12:08:57] Oops! Looks like something went wrong. Refreshing may do the trick. [12:08:59] but it doesn't [12:10:10] OK, now it does [12:14:09] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-mobrovac: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2483251 (10mobrovac) [12:15:19] !log cr2-eqiad: setting "chassis network-services enhanced-ip" and rebooting re1 (then re0 will follow) [12:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:19:12] !log cr2-eqiad: toggling mastership between routing-engines (re0->re1) [12:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:21:47] PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197) [12:23:58] RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.61 ms [12:24:33] PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::1 [12:24:47] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:24:53] ugh [12:25:27] just heavy packet loss on IPv6 [12:25:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [12:26:06] RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms [12:26:47] Has there been an increase in frequency of "readonly" states on Wikimedia sites lately? [12:26:59] no [12:27:19] !log cr2-eqiad: rebooting backup RE (re0) [12:27:23] I am checking what prevent people from publishing articles using Content Translation, and recently "readonly" has been very common. [12:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:35] Let me show what exactly do I mean by "readonly": [12:27:39] not a good time now, aharoni [12:27:48] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:28:14] OK :) [12:28:28] sorry, in the middle of a complicated upgrade [12:30:29] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [12:30:36] 06Operations, 10ops-eqiad, 10hardware-requests: decommission WMF3155-WMF3175 (old lsearchd) - https://phabricator.wikimedia.org/T140372#2483260 (10Cmjohnson) [12:31:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 65 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:31:36] 06Operations, 10hardware-requests: eqiad out of warranty spares to decommission - approval request - https://phabricator.wikimedia.org/T120679#2483262 (10Cmjohnson) [12:31:38] 06Operations, 10ops-eqiad, 10hardware-requests: decommission WMF3155-WMF3175 (old lsearchd) - https://phabricator.wikimedia.org/T140372#2462603 (10Cmjohnson) 05Open>03Resolved [12:32:08] aharoni: what other channels are you in that are relevant? :P [12:32:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:33:10] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2483265 (10Cmjohnson) 05Open>03Resolved db1058 has been removed from rack [12:34:17] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [12:35:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:35:06] 06Operations, 10ops-eqiad, 06DC-Ops: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#2483267 (10Cmjohnson) 05Open>03Resolved This was completed...all servers have been removed from racks and decommissioned. [12:35:16] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:36:07] !log change-prop deploying b7079fd9c [12:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:37:26] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 14 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:37:47] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:39:01] gehel: long time no see :) [12:39:08] can not connect to production now :( [12:39:16] zeljkof: that much? [12:39:29] ssh config https://github.com/zeljkofilipin/dotfiles/blob/master/.ssh/config [12:40:10] a couple of terminal outputs [12:40:11] https://phabricator.wikimedia.org/P3534 [12:40:16] https://phabricator.wikimedia.org/P3535 [12:41:09] !log citoid deployed 5134e49e [12:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:47] <_joe_> zeljkof: https://github.com/zeljkofilipin/dotfiles/blob/master/.ssh/config#L20 [12:41:55] zeljkof: could it be that your ssh key also needs to be updated? [12:41:59] zeljkof: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L1555 [12:42:19] zeljkof: I need to run to the doctor in a few minutes... [12:42:28] <_joe_> zeljkof: ProxyCommand None [12:42:32] gehel: :) I think the key is fine now, but I will double check [12:42:33] <_joe_> with no comment afterwards [12:42:42] <_joe_> None != none IIRC [12:42:45] yup [12:42:56] _joe_: hashar said the same thing, I have copy/pasted it from docs :| [12:42:58] ssh is sensitive to cases [12:43:07] PROBLEM - Host dbproxy1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:09] will fix and try again [12:43:17] that is 'none' [12:43:18] <_joe_> jynus: ^^ [12:43:23] <_joe_> known? [12:43:31] err wrong window sorry [12:43:43] * zeljkof is doing the needful [12:47:44] 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2483278 (10BBlack) 05Open>03declined Yes, outside of global roots, access to any of the caches is pretty tightly restricted. It's not just based on needs, but also other stabilit... [12:48:46] if dbproxy is down, gerrit and otrs are down among others [12:48:53] !log cr2-eqiad: fixing IPv6 VRRP interoperatbility between the cr1/cr2 ( http://www.juniper.net/documentation/en_US/junos14.2/topics/concept/vrrpv3-junos-support.html ) [12:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:08] which means there is an ongoing outage [12:49:11] !log cr2-eqiad: re-enabling GRES and toggling mastership between routing-engines (re1->re0) [12:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:48] hashar _joe_ gehel mobrovac: works now! [12:53:59] minor tweaks were needed in the docs https://wikitech.wikimedia.org/w/index.php?title=Production_shell_access&type=revision&diff=773056&oldid=763132 [12:54:06] thanks everybody [12:54:44] (relevant changes are in ssh config, hashar made a few text style changes too) [12:54:57] looks like inline comments in ssh config were causing trouble [12:55:10] entirely my fault [12:55:38] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:55:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:56:50] hashar: so you were the one that added the inline comments? :D [12:56:57] yup [12:57:09] without even testing it / reading the ssh_config doc about comments [12:58:04] !log manually flipping m2-master to db1020 [12:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:58:26] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:01:28] !log bounce gerrit on ytterbium [13:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:40] anyone else getting an error page from gerrit? [13:01:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:01:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:02:07] stephanebisson: should be gone now [13:02:22] godog: yep, thanks! [13:02:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:05:38] heh the gerrit bot doesn't survive gerrit outages apparently, anyways dns change is https://gerrit.wikimedia.org/r/#/c/300254/1 [13:10:00] !log cr2-eqiad: setting "chassis state cb-upgrade on" and powering off re1 (backup) [13:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:30] !log cr2-eqiad: setting fabric plane 4 to offline [13:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:57] !log cr2-eqiad: setting fabric plane 5/6/7 to offline [13:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:27] godog: should merge it now, IMHO [13:11:48] (or we risk blocking on it or accidentally reverting it if we need to make a quick DNS commit during network maint) [13:12:35] godog, yeah, restarting that bot [13:12:35] !log cr2-eqiad: setting scb 1 to offline and replacing it [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:42] bblack: indeed! merged, I'll run authdns-update too [13:13:25] (instructions for it are at https://wikitech.wikimedia.org/wiki/Grrrit-wm#Building.2FDeploying ) [13:13:52] 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483315 (10jcrespo) [13:13:57] RECOVERY - Host dbproxy1002 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [13:14:10] 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483328 (10jcrespo) ``` MariaDB MISC m2 localhost (none) > SHOW DATABASES; +--------------------+ | Database | +--------------------+ | bugzilla_testing | | frimpressions | | heartbeat | |... [13:14:19] Krenair: kk, thanks [13:14:24] ah it is back [13:17:01] (03PS1) 10Yuvipanda: shinken: Use new labs graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/300265 (https://phabricator.wikimedia.org/T140976) [13:18:07] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: Connection refused by host [13:18:17] PROBLEM - configured eth on dbproxy1002 is CRITICAL: Connection refused by host [13:18:21] !log mathoid deploying 36be4ea [13:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:28] PROBLEM - dhclient process on dbproxy1002 is CRITICAL: Connection refused by host [13:18:46] PROBLEM - DPKG on dbproxy1002 is CRITICAL: Connection refused by host [13:18:48] PROBLEM - Disk space on dbproxy1002 is CRITICAL: Connection refused by host [13:18:48] PROBLEM - haproxy process on dbproxy1002 is CRITICAL: Connection refused by host [13:18:58] PROBLEM - MD RAID on dbproxy1002 is CRITICAL: Connection refused by host [13:19:07] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: Connection refused by host [13:19:08] PROBLEM - salt-minion processes on dbproxy1002 is CRITICAL: Connection refused by host [13:19:17] PROBLEM - haproxy alive on dbproxy1002 is CRITICAL: Connection refused by host [13:19:17] PROBLEM - MPT RAID on dbproxy1002 is CRITICAL: Connection refused by host [13:19:55] YuviPanda: nice, thanks for working on grafana-labs ! did you play with prometheus-tools already? [13:20:15] godog yup, just added it as a data source! [13:20:39] godog but I can't get graphite added, am completing the migration of labs graphite to graphite-labs.wikimedia.org (behind misc varnish now) before trying it again [13:21:44] godog have you played with prometheus expression language? I've a few questions [13:21:51] YuviPanda: a bit yeah [13:23:46] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms [13:24:46] PROBLEM - Host dbproxy1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:07] godog ok, I'll dig around some more and poke you with questions :) [13:26:07] RECOVERY - Host dbproxy1002 is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [13:26:34] YuviPanda: hehe ok, let me know if you can add graphite too [13:26:50] godog will do! [13:28:47] !log cr2-eqiad: toggling mastership between routing-engines (re0->re1) [13:30:07] !log cr2-eqiad: powering off re0 (backup) [13:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:20] did it lose one? [13:30:31] SAL says no, weird [13:31:04] (03Abandoned) 10Ema: cache_upload: do not set Access-Control-Allow-Origin twice [puppet] - 10https://gerrit.wikimedia.org/r/300249 (owner: 10Ema) [13:31:16] !log cr2-eqiad: setting fabric plane 0/1/2/3 to offline [13:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:42] !log cr2-eqiad: setting scb 0 to offline and replacing it [13:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:45] (03CR) 10Ottomata: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [13:38:31] !log cr2-eqiad: toggling mastership between routing-engines (re1->re0) [13:39:21] (03CR) 10Ottomata: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [13:40:41] !log cr2-eqiad: fabric upgrade bandwidth for FPC 4/5 [13:41:09] PROBLEM - Disk space on es2001 is CRITICAL: Timeout while attempting connection [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:00] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [13:43:49] PROBLEM - Host cr1-eqord is DOWN: PING CRITICAL - Packet loss = 100% [13:43:50] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.110, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:43:50] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:43:50] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.80, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:43:51] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec [13:43:59] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) [13:43:59] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec [13:44:19] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:44:41] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:44:41] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.134, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:45:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [13:45:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 504 (exp [13:45:10] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down! [13:45:16] PROBLEM - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is CRITICAL: Connection refused [13:45:16] PROBLEM - Restbase root url on restbase1010 is CRITICAL: Connection refused [13:45:16] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=restbase.svc.eqiad.wmnet, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by class socket.error: [Errno 111] Connection refused) [13:45:17] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 1 failures [13:45:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:45:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:45:28] <_joe_> wat? [13:45:30] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 2 failures [13:45:31] PROBLEM - Restbase root url on restbase1015 is CRITICAL: Connection refused [13:45:31] PROBLEM - restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.147, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:45:40] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 37 failures [13:45:43] what's happening? [13:45:44] <_joe_> what the hell happened? [13:45:50] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down! [13:45:58] <_joe_> going to take a look on one of the rb machines [13:46:00] RECOVERY - Host cr1-eqord is UP: PING OK - Packet loss = 0%, RTA = 43.53 ms [13:46:00] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: puppet fail [13:46:00] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 36 failures [13:46:01] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [13:46:01] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [13:46:09] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Puppet has 5 failures [13:46:09] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [13:46:11] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: puppet fail [13:46:11] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: puppet fail [13:46:14] cr1-eqord down. er? [13:46:19] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [13:46:19] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: puppet fail [13:46:20] PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused [13:46:21] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down! [13:46:26] no it's not [13:46:29] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail [13:46:29] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 2 failures [13:46:30] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.133, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:46:30] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.79, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:46:31] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [13:46:31] PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.200, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:46:31] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) [13:46:32] <_joe_> this is restbase [13:46:39] PROBLEM - Restbase root url on restbase1013 is CRITICAL: Connection refused [13:46:40] PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused [13:46:50] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 6 failures [13:46:57] (03CR) 10Jgreen: [C: 031] remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) (owner: 10Dzahn) [13:46:58] (04:43:49 μμ) icinga-wm: PROBLEM - Host cr1-eqord is DOWN: PING CRITICAL - Packet loss = 100% icinga thinks/thought so [13:46:59] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 2 failures [13:46:59] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Puppet has 2 failures [13:47:00] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [13:47:10] PROBLEM - Restbase root url on restbase1012 is CRITICAL: Connection refused [13:47:10] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 5 failures [13:47:10] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: Puppet has 5 failures [13:47:11] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 3 failures [13:47:11] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 5 failures [13:47:11] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 39 failures [13:47:11] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 5 failures [13:47:11] PROBLEM - Restbase root url on restbase1014 is CRITICAL: Connection refused [13:47:12] PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused [13:47:19] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Puppet has 1 failures [13:47:20] PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: Puppet has 3 failures [13:47:20] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 29 failures [13:47:21] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures [13:47:21] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 2 failures [13:47:29] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 5 failures [13:47:29] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet has 3 failures [13:47:29] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: puppet fail [13:47:40] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Puppet has 22 failures [13:47:41] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 4 failures [13:47:49] PROBLEM - Redis status tcp_6381 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6381 [13:47:50] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 2 failures [13:47:55] <_joe_> shit [13:48:00] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 7 failures [13:48:00] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 1 failures [13:48:08] <_joe_> ok for restbase something really strange is happening [13:48:10] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 3 failures [13:48:10] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [13:48:20] what is goiing on here??? [13:48:22] godog https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-pods :D enter name of any tool in template for stats! (example: geohack / xtools-articleinfo) [13:48:30] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.073 second response time [13:48:30] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 4 failures [13:48:37] at least two indepedent problems, probably [13:48:39] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:48:39] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 3 failures [13:48:40] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 3 failures [13:48:40] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [13:48:41] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [13:49:00] PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused [13:49:02] godog however, it doesn't show up in https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-pods [13:49:11] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 6 failures [13:49:15] <_joe_> I am looking at restbase [13:49:20] RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.018 second response time [13:49:21] RECOVERY - Restbase root url on restbase1010 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.008 second response time [13:49:29] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [13:49:29] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:49:30] damn, it's seems like https://phabricator.wikimedia.org/T136957 mass-happened on RB [13:49:36] damn [13:49:40] RECOVERY - Redis status tcp_6381 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6381 has 1 databases (db0) with 9754281 keys - replication_delay is 0 [13:49:43] * mobrovac restarting RB [13:49:50] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [13:49:56] <_joe_> mobrovac: I am doing it [13:50:06] <_joe_> coordinate [13:50:12] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:50:19] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [13:50:19] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:50:31] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:50:35] k _joe_ [13:50:49] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.070 second response time [13:50:50] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:51:10] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [13:51:10] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:51:20] RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.013 second response time [13:51:21] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.016 second response time [13:51:29] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [13:51:34] <_joe_> done [13:51:37] RECOVERY - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.010 second response time [13:51:49] <_joe_> 7 minutes of outage [13:51:51] RECOVERY - Restbase root url on restbase1015 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.017 second response time [13:51:57] <_joe_> just because I didn't trust my guts :/ [13:52:17] <_joe_> mobrovac: about that ticket [13:52:19] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [13:52:25] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483364 (10mobrovac) This mass-happened today: ``` (15:43:50) icinga-wm: PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connectio... [13:52:27] well there's two problems in the spam above: whatever happened with RB shutdown, and a network blip causing a spam of puppetfail [13:52:31] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:52:31] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:52:39] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:52:39] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:52:48] one might have triggered the other [13:53:00] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:53:21] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Puppet has 1 failures [13:53:25] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1469105590309&to=1469109190309&var-site=eqiad&var-cache_type=%24__all&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [13:53:28] (03PS1) 10Rush: check_legal: mobile privacy reference is now explicitly https [puppet] - 10https://gerrit.wikimedia.org/r/300272 [13:53:34] ^ shows the dip in eqiad traffic from the public [13:53:34] <_joe_> mobrovac: I see now restbase doesn't have "Restart: always" [13:53:54] <_joe_> bblack: we clearly had a network isse [13:53:57] <_joe_> *issue [13:54:06] <_joe_> and that might have crashed restbase [13:54:17] <_joe_> but the real issue that caused such a long outage is [13:54:45] yes, it's possible that the mysterious RB outages are just hypersensitivity to network blips [13:54:51] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:54:52] <_joe_> the systemd unit having a issue [13:55:00] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: Puppet has 1 failures [13:55:03] and yeah, systemd should have some sane service-restart config [13:55:10] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2009_v4, cp2021_v4 [13:55:30] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2018_v4, cp2025_v4 [13:55:30] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2018_v4 [13:55:44] (03CR) 10Rush: [C: 032] check_legal: mobile privacy reference is now explicitly https [puppet] - 10https://gerrit.wikimedia.org/r/300272 (owner: 10Rush) [13:55:59] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2006_v4, cp2012_v4 [13:56:50] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:56:58] <_joe_> mobrovac: it seems you chose to shoot yourself in the foot with @init_restart = false [13:57:16] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2483375 (10Jgreen) > You will have to configure your ssh client to connect via the bastion hosts to any servers in our interna... [13:57:40] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 24 ESP OK [13:57:51] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:59:19] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [13:59:22] (03CR) 10Jgreen: [C: 031] admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn) [13:59:41] RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.041 second response time [13:59:47] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483376 (10Joe) Any production service running on systemd and not having ``` Restart=always ``` is a large liability as shown by the outage we just experienced. This be... [13:59:49] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK [14:00:11] (03PS2) 10Yuvipanda: shinken: Use new labs graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/300265 (https://phabricator.wikimedia.org/T140976) [14:00:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:00:22] (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: Use new labs graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/300265 (https://phabricator.wikimedia.org/T140976) (owner: 10Yuvipanda) [14:01:30] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures [14:02:00] (03PS1) 10Giuseppe Lavagetto: restbase: have systemd restart failed nodes [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) [14:02:08] (03PS1) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/300276 (https://phabricator.wikimedia.org/T136957) [14:02:18] (03PS2) 10Chad: Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 [14:02:32] <_joe_> godog, mobrovac since you're the two making the call on not having Restart=always in RB [14:02:40] !log cr2-eqiad: disabling all asw-*-eqiad interfaces [14:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:48] <_joe_> please review https://gerrit.wikimedia.org/r/300275 [14:03:02] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.048 second response time [14:04:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We still have no proof it's firejail killing restbase and we have no idea of a root cause." [puppet] - 10https://gerrit.wikimedia.org/r/300276 (https://phabricator.wikimedia.org/T136957) (owner: 10Mobrovac) [14:04:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:04:21] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [14:04:22] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.016 second response time [14:04:23] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:04:31] !log cr2-eqiad: disabling xe-5/2/3 (link to cr2-codfw) [14:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:48] !log cr2-eqiad: disabling xe-4/2/0 (link to cr1-eqord) [14:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:16] (03CR) 10Chad: [C: 032] Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 (owner: 10Chad) [14:05:23] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483391 (10mobrovac) >>! In T136957#2483376, @Joe wrote: > Any production service running on systemd and not having > > ``` > Restart=always > ``` >... [14:05:43] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:05:47] (03Merged) 10jenkins-bot: Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 (owner: 10Chad) [14:06:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 67, down: 4, dormant: 0, excluded: 0, unused: 0BRae3: down - Core: asw-c-eqiad:ae2BRae4: down - Core: asw-d-eqiad:ae2BRae1: down - Core: asw-a-eqiad:ae2BRae2: down - Core: asw-b-eqiad:ae2BR [14:07:02] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483406 (10Joe) Crashes happen. We need to be able to survive a mass crash (we can on the appservers precisely because upstart restarts the services... [14:07:07] !log cr2-eqiad: halting both routing engines(!) [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:36] <_joe_> mobrovac: seriously, explain me why it's a good idea not to restart restbase when it stops without a human telling it to stop [14:07:51] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 4.27 ms [14:07:57] <_joe_> because I can't find a good reason not to [14:09:27] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483407 (10mobrovac) From https://gerrit.wikimedia.org/r/#/c/300276/ by @Joe: > We still have no proof it's firejail killing restbase and we have no... [14:09:51] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:09:53] _joe_: why do we want to tolerate a service being killed? [14:10:04] <_joe_> mobrovac: because we want to serve users? [14:10:11] <_joe_> it's not like you don't get to know it [14:10:13] <_joe_> it's logged [14:10:21] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:10:22] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:10:32] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [14:10:42] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:10:45] <_joe_> I mean give me a reason why systemd should not restart rb when it fails [14:10:51] PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197) [14:10:52] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:10:53] <_joe_> which is not "we would not notice" [14:11:01] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:11:02] <_joe_> because if you intend to, you will [14:11:11] PROBLEM - Host cr2-eqiad IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ffff::2 [14:11:29] (03PS5) 10Chad: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 [14:11:31] (03PS1) 10Chad: Gerrit: Store the ssh_host_key in private puppet secrets [puppet] - 10https://gerrit.wikimedia.org/r/300279 [14:11:32] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:11:32] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:11:32] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:11:42] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:11:44] (03PS2) 10Chad: Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 [14:11:51] (03CR) 10Chad: [C: 032] Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 (owner: 10Chad) [14:11:52] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:11] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:12:11] _joe_: as you may have gotten from the ticket, i don't think that's restbase failing, but rather firejail killing it [14:12:12] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:12] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:12:12] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:12:21] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:12:22] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:23] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:24] _joe_: which a different problem [14:12:31] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:31] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:32] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:32] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:12:33] _joe_: i do agree that the outcome for users is the same [14:12:46] <_joe_> mobrovac: I think you got it wrong, but even if it was, still explain me why Restart=always is a bad idea [14:12:56] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:57] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:57] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:58] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:13:06] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:09] <_joe_> I think your problem is that firejail translates rb crash exit codes to 0 [14:13:16] (03Merged) 10jenkins-bot: Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 (owner: 10Chad) [14:13:25] <_joe_> which makes systemd without restart=always NOT restart the service [14:13:27] (03CR) 10Paladox: [C: 031] Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [14:13:38] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:38] <_joe_> but I just gave a quick look [14:13:46] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:47] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:06] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:14:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 5, dormant: 0, excluded: 0, unused: 0BRxe-5/3/0: down - Core: cr2-eqiad:xe-5/3/0 {#2651} [10Gbps DF]BRxe-4/3/0: down - Core: cr2-eqiad:xe-4/3/0 {#3456} [10Gbps DF]BRae0: down - Core: cr2-eqiad:ae0BRae0.0: down - BRxe-5/2/0: down - Core: cr2-eqiad:xe-5/2/0 {#1983} [10Gbps DF]BR [14:14:27] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:27] !log demon@tin Synchronized wmf-config/: extension list cleanups (duration: 00m 34s) [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:37] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:14:46] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:14:57] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:15:07] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:15:20] <_joe_> mobrovac: I'm not saying we should not disable firejail, just I'd want a bit more evidence that's the issue here [14:15:27] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:15:39] (03CR) 10Paladox: [C: 031] Gerrit: Store the ssh_host_key in private puppet secrets [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad) [14:15:47] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:15:47] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:16:32] I think default is Restart=no which doesn't restart in any case [14:16:54] <_joe_> godog: nope [14:17:57] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK [14:18:09] <_joe_> still, I need a good reason NOT to enable that [14:18:46] RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.69 ms [14:19:24] _joe_: double check, restart=no is the systemd default [14:19:30] <_joe_> godog: yep [14:19:37] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:53] <_joe_> we seriously want all our prod user-facing apps to restart=always [14:19:58] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [14:20:53] anyways let's figure out what happened first and then what to do with restart behaviour [14:20:57] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: puppet fail [14:21:03] _joe_: godog: ok, historical context, when we set restart=no for RB, the problem was systemd restarting it continuously after failed starts (where RB wouldn't even be able to start up in the first place) [14:21:14] that was relevant at the time [14:21:21] because we had schema changes going on [14:21:23] <_joe_> mobrovac: there is an option to limit that [14:21:44] <_joe_> and I think we use it? [14:21:52] to limit what? [14:22:04] <_joe_> the rate at which a service will be restarted [14:22:38] also it should have been restart=on-failure, no? [14:22:42] RestartSec [14:22:53] <_joe_> godog: not really [14:22:56] we have it set at 2 [14:23:22] <_joe_> godog: actually, if we wanted to get fancy, we could build into service-runner a systemd notifier [14:23:36] RECOVERY - Host cr2-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 3.26 ms [14:24:53] so systemd will restart a service at max once in 2 seconds [14:25:16] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2483419 (10Paladox) >>! In T140894#2482428, @demon wrote: > Let's do this tomorrow morning maybe? :) [14:26:33] !log cr2-eqiad: reenabling all asw-*-eqiad interfaces [14:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 192, down: 0, dormant: 0, excluded: 0, unused: 0 [14:29:59] !log cr2-eqiad: reenabling xe-4/2/0 (link to cr1-eqord) and xe-5/2/3 (link to cr2-codfw) [14:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:17] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [14:32:13] godog: unrelated, but i see in rb1008 syslog java.net.UnknownHostException: graphite1003.eqiad.wmnet [14:32:20] from the metrics collector [14:32:24] from an hour ago [14:33:58] 06Operations, 06Labs, 13Patch-For-Review: Move labs graphite to graphite-labs.wikimedia.org - https://phabricator.wikimedia.org/T140899#2483436 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done and left redirects in place! [14:34:00] <_joe_> yeah there was some dns failure around that time [14:34:06] <_joe_> mobrovac: what time exactly? [14:34:51] _joe_: same time as the RB failure - 13:41:12 [14:35:03] <_joe_> yeah same time of the dns failures on the puppetmaster [14:35:18] <_joe_> mobrovac: I think this is strictly related to this rb crash tbh [14:35:33] this == dns failure? [14:35:36] <_joe_> yes [14:35:49] <_joe_> to the rb crash [14:35:53] !log cr2-eqiad: enabling Fundraising interface & BGP [14:35:54] Jul 21 13:40:54 re0.cr2-eqiad alarmd[2899]: Alarm cleared: CB color=RED, class=CHASSIS, reason=CB fabric links require upgrade/training [14:35:54] Jul 21 13:40:54 re0.cr2-eqiad craftd[1672]: Major alarm cleared, CB fabric links require upgrade/training [14:35:56] hm [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:07] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0 [14:36:17] <_joe_> I am going to take a pause [14:36:25] <_joe_> it's been a stressful 2 hours [14:37:05] !log cr2-eqiad: reenabling Transit interfaces & BGP [14:37:08] _joe_: you don't say.. [14:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:11] I think i'm more calm now than doing my regular job ;p [14:37:16] godog also did you include the icinga prometheus check in our infrastructure? [14:37:30] <_joe_> paravoid: ehehh [14:38:25] YuviPanda: I didn't yet, no [14:38:56] godog ok, let me know when you do :) also the graphs in admin grafana aren't showing up in readonly grafana, let me know if you have time to help investigate :) [14:39:29] !log cr2-eqiad: reenabling IX interface & BGP [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:34] !log cr2-eqiad: restoring PyBal BGP sessions [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:59] !log cr2-eqiad: restoring VRRP priorities [14:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:11] (03PS2) 10MarcoAurelio: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) [14:43:58] !log cr2-eqiad is now upgraded, passing transit and cross-DC traffic and is the VRRP master in eqiad [14:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:58] (03PS1) 10Mobrovac: Revert "Change-prop: Ignore bot edits on ORES precache updates." [puppet] - 10https://gerrit.wikimedia.org/r/300282 [14:45:26] RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms [14:45:37] (03PS3) 10MarcoAurelio: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) [14:48:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Change-prop: Ignore bot edits on ORES precache updates." [puppet] - 10https://gerrit.wikimedia.org/r/300282 (owner: 10Mobrovac) [14:48:42] mobrovac: ^ [14:48:48] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:49] grazie godog! [14:48:52] prego :) [14:49:37] <_joe_> mobrovac: uh, what happened? [14:50:10] _joe_: a bug in the code of the extension sending the flag that is checked by changeprop :( [14:50:19] <_joe_> lol [14:52:10] 06Operations, 10ops-eqiad, 06DC-Ops: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#2483487 (10Cmjohnson) Most all of the servers are removed...there are a few still in production dbproxy1001 dbproxy1002 dbproxy1003 scandium uranium radium [14:54:15] (03PS6) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 [14:55:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 (owner: 10Giuseppe Lavagetto) [14:55:26] 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2483494 (10AndyRussG) >>! In T140958#2483278, @BBlack wrote: > Yes, outside of global roots, access to any of the caches is pretty tightly restricted. It's not just based on needs, b... [14:56:39] (03PS1) 10Cmjohnson: Removing mgmt dns from cp1043/1044 decom'd t133614 [dns] - 10https://gerrit.wikimedia.org/r/300284 [14:58:32] !log stopping dbstore1002 for scheduled maintenace T119488 [14:58:33] T119488: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488 [14:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1500). Please do the needful. [15:00:04] yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] (03PS1) 10Gehel: New partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) [15:00:53] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2483500 (10bd808) The crons being on all role::logstash nodes was intentional because as you say multiple invocations of th... [15:03:25] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2483501 (10Gehel) Ok, redundancy makes sense. I can delete all puppet managed crons and re-run puppet, which should cleanup... [15:04:03] (03PS1) 10Giuseppe Lavagetto: puppetmaster: fix apache vhost syntax [puppet] - 10https://gerrit.wikimedia.org/r/300287 [15:04:36] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: fix apache vhost syntax [puppet] - 10https://gerrit.wikimedia.org/r/300287 (owner: 10Giuseppe Lavagetto) [15:07:57] (03CR) 10Gehel: "Since this change does not seem to be needed, should we drop it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297459 (owner: 10DCausse) [15:08:12] (03PS1) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300288 [15:09:28] (03PS2) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300288 [15:10:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300288 (owner: 10Giuseppe Lavagetto) [15:12:41] 06Operations, 10fundraising-tech-ops, 10netops: Cleanup layer2 firewall config from pfw-eqiad - https://phabricator.wikimedia.org/T111463#2483519 (10Jgreen) [15:13:05] 06Operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2483525 (10demon) [15:13:07] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.067 second response time [15:13:07] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy server lead as jessie gerrit server - https://phabricator.wikimedia.org/T126794#2483521 (10demon) 05Open>03Resolved Lead is deployed and running gerrit on Jessie. It's just not the master yet. That's T70271. [15:13:09] <_joe_> puppet failures are expected now [15:13:24] <_joe_> I am running puppet on the puppet masters, and that will reload apache [15:14:33] 06Operations, 10Gerrit, 13Patch-For-Review: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2483526 (10demon) 05Open>03Resolved a:03demon That public key won't be changing, neither will the ssh host key. I'm tentatively closing this. [15:15:26] 06Operations, 10fundraising-tech-ops, 10netops: Cleanup layer2 firewall config from pfw-eqiad - https://phabricator.wikimedia.org/T111463#2483531 (10Jgreen) p:05Low>03High bumping to high because this blocks adding pfw ports, which in turn blocks hardware refreshes [15:18:12] (03CR) 10Eevans: [C: 04-1] "This isn't unreasonable, but I'm -1 for committing this cluster-wide at the moment. It should be tested in a more isolated manner first, " [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) (owner: 10GWicke) [15:20:04] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2483556 (10Jgreen) [15:21:04] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#2483574 (10Danny_B) p:05Triage>03Unbreak! Confirming. [15:21:46] _joe_, mobrovac when do you guys want to try a dummy parsoid deploy today to verify trebuchet deploys are fine? we can try any time or do it during the services window in ~90 odd mins. [15:22:06] <_joe_> subbu: we're going into a meeting in 8 minutes [15:22:18] <_joe_> so I'd say let's test it either right now [15:22:21] <_joe_> or in 40 [15:22:23] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2483582 (10Jgreen) [15:22:27] ok .. in 40 then. [15:32:10] (03PS2) 10Andrew Bogott: Disable instance rebuild in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/300077 (https://phabricator.wikimedia.org/T140259) [15:32:12] (03PS1) 10Andrew Bogott: Use special monitor-account creds for the rabbitmq collector [puppet] - 10https://gerrit.wikimedia.org/r/300293 [15:32:14] (03CR) 10RobH: [C: 031] "a few notes:" [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [15:33:04] 06Operations, 10ops-codfw, 10netops: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2483624 (10Papaul) ge-4/0/0 up up mw2239 ge-4/0/1 up up mw2240 ge-4/0/2 up up mw2241 ge-4/0/3 up up mw2242 ge-4/0/4 up up mw2243 ge-4/0/5 up up mw2244 ge-4/0/6 up up mw2245 ge-4/0/7 up up mw2246 g... [15:34:38] joal, did you just saw my email [15:34:57] joal, feel free to check if there is something broken on your side [15:38:30] there are api issues with wikidata [15:38:36] (03CR) 10Andrew Bogott: [C: 032] Disable instance rebuild in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/300077 (https://phabricator.wikimedia.org/T140259) (owner: 10Andrew Bogott) [15:39:09] "Wikibase\Repo\Store\WikiPageEntityStore::updateWatchlist: Automatic transaction with writes in progress (from DatabaseBase::query (LinkCache::addLinkObj)), performing implicit commit!" [15:39:20] It could be not issue, maybe just log noise? [15:39:27] issues* [15:39:41] (03CR) 10Andrew Bogott: [C: 032] Use special monitor-account creds for the rabbitmq collector [puppet] - 10https://gerrit.wikimedia.org/r/300293 (owner: 10Andrew Bogott) [15:41:46] (03PS1) 10ArielGlenn: clean up verbose mode print of commands to run [dumps] - 10https://gerrit.wikimedia.org/r/300294 [15:42:33] this seems to be happening since 10:10, but I do not see any deployments there [15:44:37] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [15:45:56] oh, perfectly reported already! https://phabricator.wikimedia.org/T140955 [15:46:08] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2483658 (10mark) @RobH could you prepare quotes for this? Thanks! [15:46:16] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:46:46] jynus: :) :) [15:46:47] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:46:54] thanks, greg-g [15:46:56] (03PS3) 10MarcoAurelio: Closing wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298772 (https://phabricator.wikimedia.org/T139032) [15:47:13] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2483662 (10Florian) [15:48:27] specially puting the error/function on the title helps with the vilsibility (I also do that to avoid duplicate reports) [15:48:47] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:05] jynus: ditto [15:50:08] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [15:50:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:51:10] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2483687 (10RobH) a:03RobH [15:55:05] bblack: hey! lmk if sometime you'd like to have another go at checking for CN cookies on a cache server. Sorry for unnecessarily opening the access task... I'd especially like to see the full details of the "*-campaign (where * = 'enwiki', 'eswiki', etc.)" bit from the previous attempt... thx!!! [15:55:43] I believe full results can't be posted anywhere public due to privacy issues, so if you're OK with it, another channel could be found. Thx again :) [15:56:16] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2483717 (10CCogdill_WMF) After pushing IBM for a couple weeks, they finally sent us this response today: “After reviewing... [15:56:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:57:07] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1600). Please do the needful. [16:00:05] hashar, urandom, and thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] o/ [16:00:17] AndyRussG: from the list we have so far, it seems like there's no general pattern that covers all CN cookies, right? That might be another important thing going forward: given them all a common prefix or suffix, like "CN_" [16:01:04] jynus: bah, missed your comment as I was typing mine, sorry for being redundant [16:01:13] np [16:01:20] it happens very frequently [16:01:21] bblack: indeed. That's what we do have going forward :) The unpredictable names are basically from in-banner JS included in community banners over time [16:02:00] ok [16:02:13] jynus: https://phabricator.wikimedia.org/T765 :) [16:02:28] nice [16:02:30] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2483755 (10debt) 05Open>03Resolved a:03debt [16:02:52] present: o/ [16:03:13] blocked on exposing websocket ports [16:04:17] (03PS2) 10Gehel: Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) [16:06:15] ok urandom first [16:07:12] (03PS3) 10Filippo Giunchedi: RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans) [16:07:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans) [16:07:21] godog: r300056 is already applied ephemerally everywhere [16:07:29] so it just makes sure it doesn't change back on a restart [16:07:50] urandom: ah, ok thanks! [16:08:15] godog: r300059 is going to require restarts, but given the issues with that, i'll probably do it selectively at first [16:08:29] it only effects streaming though, so it's not something that would bring down the cluster [16:08:45] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [16:09:57] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [16:10:13] 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483793 (10jcrespo) p:05Triage>03Normal dbproxy1002 seems to be back up again thanks to @fgiunchedi and @Joe. I will point the DNS back to the proxy again at an appropriate window. [16:10:15] urandom: indeed, so to be sure, that means timeout: 0 across the board [16:10:23] yeah [16:10:29] which is what it was in 2.1, fwiw [16:10:42] ok! [16:10:48] (03PS2) 10Filippo Giunchedi: Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:10:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:11:26] godog: thanks! [16:11:32] (03CR) 10RobH: [C: 031] Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [16:11:33] bblack: thx again!! [16:11:43] urandom: np! [16:12:37] <_joe_> subbu: so let's try a deploy? [16:12:42] sure. [16:12:51] let me get onto tin [16:13:15] (03CR) 10Thcipriani: "Puppet compiler output: https://puppet-compiler.wmflabs.org/3426/" [puppet] - 10https://gerrit.wikimedia.org/r/300175 (owner: 10Thcipriani) [16:13:39] !log starting parsoid deployment [16:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:53] thcipriani: you're up [16:14:12] godog: okie doke [16:14:27] (puppet compiler in the nick of time) [16:14:42] 06Operations, 10ops-eqiad, 10netops: Upgrade cr1/cr2-eqiad JunOS - https://phabricator.wikimedia.org/T140770#2483833 (10faidon) [16:14:49] (also, I confess, for some time we did facilitate banners doing this in a couple ways, for the purpose of helping people limit banners shown... but we didn't consider the cookie consequences. The cookies created like this, i.e., with re-purposed JS from FR-banners, and also from a briefly-deployed feature, are the ones where there are pairs with one ending in "-wait".) [16:15:12] _joe 44/45 minions completed fetch ... [16:15:15] (bblack: ^) [16:15:20] so, the 45th minion is ruthenium? [16:15:24] <_joe_> subbu: sigh, ruthenium? [16:15:30] <_joe_> I damn removed it [16:15:36] yup. ruthenium [16:15:37] ruthenium.eqiad.wmnet: [16:15:37] fetch status: None [started: 1 mins ago, last-return: None mins ago] [16:15:53] so, should i continue or abort? [16:15:56] (03PS21) 10Filippo Giunchedi: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [16:15:58] have to remove it from the redis instance on tin to make it go away. [16:16:02] <_joe_> abort I guess [16:16:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [16:16:05] <_joe_> thcipriani: I did [16:16:17] _joe_, aborted. [16:16:29] oh, weird. I thought you just meant removed the target from the instance grains [16:16:51] (03CR) 10GWicke: "@eevans: We have been running on significantly lower trickle fsync intervals before, and only increased it as a larger interval was still " [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) (owner: 10GWicke) [16:16:58] !log aborted (test) parsoid deployment [16:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:33] <_joe_> subbu: let me inspect this again [16:17:36] k [16:17:40] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2483845 (10faidon) [16:17:42] 06Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2483842 (10faidon) 05Resolved>03Open @cmjohnson, if I recall correctly, you swapped cr2's fantray with the new one but not cr1's, since they were the exact same model... [16:18:11] (03PS1) 10Chad: Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 [16:18:28] <_joe_> thcipriani: is there a written procedure of how to remove a minion from trebuchet? [16:18:58] IIRC it was on wikitech, involving redis [16:19:11] AndyRussG: I'm taking a 1h sample now, will report back later [16:19:26] <_joe_> godog: I removed the minion from the list in redis yesterday [16:19:27] (03PS2) 10Filippo Giunchedi: Prerequisites for logstash_checker use [puppet] - 10https://gerrit.wikimedia.org/r/300175 (owner: 10Thcipriani) [16:19:31] <_joe_> but it was back now [16:19:38] (03PS3) 10Gehel: Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) [16:19:43] _joe_: https://phabricator.wikimedia.org/T132182 [16:19:51] _joe_: err https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis [16:19:54] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2483848 (10jcrespo) a:05jcrespo>03None I do not know why this is assigned to me, these requests should be handled by https://wikitech.wikimedia.org/wiki/O... [16:20:31] <_joe_> hashar: what i did exactly... [16:20:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "merging, though logstash_checker should be moved to service_checker package" [puppet] - 10https://gerrit.wikimedia.org/r/300175 (owner: 10Thcipriani) [16:20:45] 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483855 (10jcrespo) a:03jcrespo [16:21:05] _joe_: there was this: https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis but that doesn't capture the whole thing process, just the reporting. You'll also need to remove the grain from the instance otherwise trebuchet will try to use it again. [16:21:22] * thcipriani looks up the name of the grain [16:21:26] _joe_: so I guess they are magically added back again due to a puppet deployment::target that is leftover (pure speculation) [16:22:08] (03PS1) 10BBlack: puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 [16:22:09] ah the deployment_target: grain on the instance will have an array of things deployed to it [16:22:50] hashar: where did you see https://gerrit.wikimedia.org/r/#/c/298568/2 failing btw? [16:22:56] failing as in, not working as expected [16:23:43] <_joe_> subbu: try now? [16:23:49] ok .. [16:23:55] (03PS1) 10Chad: Gerrit: Disable downloading of archives [puppet] - 10https://gerrit.wikimedia.org/r/300304 [16:24:03] !log starting (test) parsoid deployment [16:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:25] godog: thank you for the merges :) [16:25:05] (03CR) 10Paladox: [C: 031] "We can directly link GitHub in phabricator :)" [puppet] - 10https://gerrit.wikimedia.org/r/300304 (owner: 10Chad) [16:25:21] thcipriani: np, do you think it could be moved out of puppet anytime soon? [16:25:29] _joe_, 44/44 now .. so whatever you did worked. [16:25:36] continuing. [16:25:39] <_joe_> actually I stated pretty clearly I wanted that not merged [16:25:51] <_joe_> as it should've been moved to service-checker [16:26:05] <_joe_> but well, now I'll have to do another transition, it's ok though [16:26:11] (03PS4) 10Paladox: phab: only mirror refs/heads/ and ./tags/ for mwcore and ops/puppet [puppet] - 10https://gerrit.wikimedia.org/r/295011 [16:26:18] <_joe_> this isn't used in nagios, so it's simpler [16:26:31] <_joe_> we also have no tests for it [16:26:35] <_joe_> which is unfortunate [16:26:55] <_joe_> anyways, whatever, it's very late (again) and I have to go in 10 minutes [16:27:20] _joe_: heh I haven't seen your do not merge comment [16:27:34] <_joe_> godog: I think there was no comment on the patch actually [16:27:41] <_joe_> my bad [16:27:50] <_joe_> that's why i am not complaining with you :) [16:27:51] ack, it can be moved. I really want to get something in place to catch terrible deploys before they hit production very soon, hence the sudden movement [16:27:56] !log synced parsoid code; restarting parsoid on wtp1001 as a canary [16:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:02] (03PS4) 10Gehel: Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) [16:28:05] ok but yeah not a huge deal [16:28:13] !log Cancelling 2003-c bootstrap, and disabling Puppet on restbase2003.codfw.wmnet to keep instance down : T134016 [16:28:14] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:28:14] wtp1001 looking good .. restarting parsoid all nodes. [16:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:27] ottomata: could I get you to take a look at https://phabricator.wikimedia.org/T140342#2480251 ? :) [16:29:32] <_joe_> subbu: ack, I am going off then [16:29:52] _joe_, thanks. looks good. [16:30:22] !log finished (test) deploy of parsoid sha ed2f8228 [16:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:19] (03Abandoned) 10Paladox: phab: only mirror refs/heads/ and ./tags/ for mwcore and ops/puppet [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox) [16:32:12] hashar: I'd like more reviews on https://gerrit.wikimedia.org/r/#/c/276346, we can talk about https://gerrit.wikimedia.org/r/#/c/298568/ tomorrow too [16:33:05] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#2483898 (10Gehel) a:03EBernhardson This seems to be related to interwiki search. @EBernhardson has a patch already,... [16:33:28] (03CR) 10Ema: [C: 031] puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 (owner: 10BBlack) [16:33:42] (03CR) 10Gehel: [C: 032] Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [16:33:52] (03PS3) 10Filippo Giunchedi: contint: APPEND unattended upgrade allowed-origins [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [16:33:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: APPEND unattended upgrade allowed-origins [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [16:34:06] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#2483910 (10EBernhardson) [16:34:22] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: Connection refused [16:34:22] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2483912 (10Ottomata) Naw, this is totally fine. `analytics-wmde` is a user we created... [16:34:23] hashar: nevermind, I thought https://gerrit.wikimedia.org/r/#/c/298568/ was global not only contint, merged [16:34:30] got that ^^^^ [16:34:41] godog: sorry, I just merged a patch during your window... [16:35:04] gehel: np, 99% of cases patches are ok to puppet-merge [16:35:12] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-23 16:34:57. [16:35:20] (03PS1) 10Giuseppe Lavagetto: puppetmaster: temporarily allow rhodium to compile all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/300307 (https://phabricator.wikimedia.org/T98173) [16:35:25] godog: should I merge mine and yours together? [16:35:56] gehel: yeah go for it, can't merge separately I think [16:36:05] godog: done [16:36:52] !log T134016: Restarting Cassandra to apply new stream timeout (restbase2003-a.codfw.wmnet) [16:36:53] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:08] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2483918 (10Joe) So, rhodium can now successfully compile its own catalog through the puppetmaster infrastructure (a... [16:38:35] !log T134016: Restarting Cassandra to apply new stream timeout (restbase2003-b.codfw.wmnet) [16:38:36] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:22] PROBLEM - cassandra-c service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [16:39:32] mine ^^^ [16:40:36] ACKNOWLEDGEMENT - cassandra-c service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Maintenance (T134016), be back soon. - The acknowledgement expires at: 2016-07-22 16:39:58. [16:40:58] 06Operations, 10ops-eqiad, 10netops: Upgrade cr1/cr2-eqiad JunOS - https://phabricator.wikimedia.org/T140770#2483946 (10faidon) OK, today we upgraded JunOS on cr2-eqiad to 13.3R9, as well as swapped the SCBs with new ones. The JunOS upgrade all generally worked without many issues and took about ~2hrs. The... [16:41:15] !log T134016: Restarting Cassandra to apply new stream timeout (restbase200r-a.codfw.wmnet) [16:41:16] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:18] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2483971 (10faidon) cr2's SCBs were upgraded today, which didn't go very smoothly for various reasons. T140770 has the full writeup. cr2 still doesn't have the new linecard install,... [16:42:52] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:53] * jynus fixes icinga check. /me realizes on merge conflict that someone had already sent a patch for it :-( [16:43:12] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:43:21] (03PS5) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [16:43:36] !log T134016: Restarting Cassandra to apply new stream timeout (restbase2004-b.codfw.wmnet) [16:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:03] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:44:04] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2480587 (10Gehel) The NDA group grants access to grafana-admin and a [[ https://wikitech.wikimedia.org/wiki/LDAP_Groups | few more things ]]. If @Jonas has al... [16:44:12] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [16:44:17] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2483976 (10ssastry) 05stalled>03Open [16:44:36] 06Operations, 06Services: Move all Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#2483977 (10ssastry) [16:46:09] !log T134016: Restarting Cassandra to apply new stream timeout (restbase2008-a.codfw.wmnet) [16:46:10] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:51] !log ebernhardson@tin Synchronized php-1.28.0-wmf.11/extensions/CirrusSearch/includes/Searcher.php: T140950: Deploy UBN fix to CirruSearch (duration: 00m 31s) [16:46:52] T140950: Undefined property: CirrusSearch\InterwikiSearcher::$searchContext - https://phabricator.wikimedia.org/T140950 [16:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:38] !log T134016: Restarting Cassandra to apply new stream timeout (restbase2008-b.codfw.wmnet) [16:47:38] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:11] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [16:49:47] (03PS6) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [16:50:34] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2484019 (10Gehel) p:05Triage>03Normal [16:50:51] !log T134016: Restart of codfw rack 'c' instances to apply stream socket timeout complete [16:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:10] chasemp YuviPanda andrewbogott the carbon-cache too many creates were from rabbitmq for labs, not a problem though just FYI [16:51:12] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:51:32] !log T134016: Starting bootstrap of restbase2003-c.codfw.wmnet [16:51:33] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [16:51:34] godog: I don't know what that means; is it just because I restarted it too many times in a row? [16:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:58] (03CR) 10Krinkle: "it seems foundation: it still protocol-relative (but not wikimedia: and wmf:), is that intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298696 (owner: 10Legoktm) [16:52:39] YuviPanda: to answer your question, I played around with sth like your tool dropdown in https://prometheus.wmflabs.org/grafana/dashboard/db/http-s-tcp-probes-drilldown [16:53:26] YuviPanda: so the instance name is a query to prometheus to auto-fill it based on what's there [16:53:52] RECOVERY - cassandra-c service on restbase2003 is OK: OK - cassandra-c is active [16:53:55] (03PS3) 10Yuvipanda: cold-migrate: use novaenv.sh for credentials [puppet] - 10https://gerrit.wikimedia.org/r/299602 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [16:54:08] (03PS4) 10Yuvipanda: cold-migrate: use novaenv.sh for credentials [puppet] - 10https://gerrit.wikimedia.org/r/299602 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [16:54:31] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [16:54:33] (03PS7) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [16:54:39] (03CR) 10Yuvipanda: [C: 032 V: 032] cold-migrate: use novaenv.sh for credentials [puppet] - 10https://gerrit.wikimedia.org/r/299602 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [16:54:49] (03PS2) 10Yuvipanda: cold-migrate: activate/deactivate base image as needed. [puppet] - 10https://gerrit.wikimedia.org/r/299661 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [16:55:07] (03CR) 10Yuvipanda: [C: 032 V: 032] cold-migrate: activate/deactivate base image as needed. [puppet] - 10https://gerrit.wikimedia.org/r/299661 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [16:55:11] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [16:55:32] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1700). [17:01:58] (03CR) 10Paladox: "@Krinkle hi, I didn't do the logo's since it was late and I'm not sure how to." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [17:03:41] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2484106 (10Dereckson) [17:04:40] jenkins/gerrit seems to be having problems. known? [17:05:28] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2448119 (10RobH) [17:05:54] gerrit gave me an error when submitting a review comment ("Code Review - Error \n Server Unavailable \n 0") [17:06:18] cscott: did it persist? [17:06:24] and jenkins jobs are failing trying to clone from gerrit [17:06:30] greg-g: yes, still won't submit [17:06:39] did apergos file that task last night? [17:06:47] greg-g gerrit is slow for me too [17:06:49] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2484158 (10ssastry) Okay, 2 months later, we are now ready to pick this up again. @akosiaris @mobrovac .. is first week of August a good time to pick this up again? [17:06:55] but i think i know why [17:07:10] 07:47 < apergos> !log restarted gerrit on ytterbium, it was refusing to complete git fetches for large repos (mw core, puppet...) [17:07:20] !log cleaning leftover crons on logstash* servers - T140973 [17:07:21] T140973: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973 [17:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:28] it happened last night and a.pergos restarted it to fix it [17:07:42] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10demon) Why on earth... [17:07:48] ostriches is pushing refs/changes/ for mw-core and operations/puppet to github mirror. [17:07:54] yes, gerrit is still here but slow [17:08:08] that could be it, but then why did it happen last night at midnight pacific? [17:08:34] Not sure though. [17:08:37] midnight.. sounds like cron [17:08:43] midnight ish [17:08:48] That's not it. [17:08:48] * greg-g looks at his logs [17:08:52] And I'm not running that right now [17:08:56] Oh [17:08:56] https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm-composer/5154/console is a recent failure-to-clone. [17:09:20] yeah gerrit is slow alright [17:09:27] gerrit http unresponsive [17:09:31] 17:01:59 git.exc.GitCommandError: 'git remote update origin' returned with exit code 1 [17:09:36] It failed to connect. [17:09:39] (03CR) 10Paladox: "@MarcoAurelio there is no such file called that, but there is something called visualeditor-nondefault.dblist but it disables visualeditor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [17:09:42] Ugh. [17:09:49] 17:01:59 git.exc.GitCommandError: 'git remote update origin' returned with exit code 1 [17:09:49] mutante: it wasn't on the hour mark, both before and after, afaict [17:09:51] * ostriches puts on his workin' hat. [17:09:52] 17:01:59 stderr: 'error: RPC failed; result=22, HTTP code = 503 [17:09:55] yeah [17:09:57] oops [17:09:58] What gives. [17:10:01] see eg from last night: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/113762/console [17:10:11] is it the regular "users exhausting connections" problem? [17:10:17] btw gerrit has been bounced earlier today because of dbproxy1002, if that's relevant [17:10:30] I have never seen that happen before with gerrit. [17:10:39] Maybe someone could be attacking gerrit [17:11:22] ostriches: i am checking the mgmt console now [17:11:59] yes, the queue is full: https://wikitech.wikimedia.org/wiki/Gerrit#Tasks_management [17:12:09] should I start killing jobs? [17:13:04] No. [17:13:06] Please don't. [17:13:10] ok [17:13:15] that is why I asked [17:13:24] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:13:26] !log gerrit: killed a couple of long-running git-upload-pack's for mediawiki/core [17:13:28] i cant login. want me to reboot ytterbium? [17:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:13:34] I'm already logged in fine [17:13:37] ok [17:13:44] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:14:00] Looks like Nikerabbit's observation this morning in #wikimedia-releng had a reason :) [17:14:14] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:14:24] Bleh. [17:14:33] gerrit is back [17:14:34] now [17:14:37] Works fine for me [17:14:46] Oh wait [17:14:51] the blue background is gone [17:14:57] with logo [17:15:02] !log gerrit: restarting [17:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:14] That's dumb gerrit. [17:15:25] Why you explode on a bunch of git-upload-packs? [17:15:31] Oh ha [17:15:46] Maybe it is fixed in gerrit 2.12. so hopefully this problem wont happen [17:15:47] again [17:15:49] I'm now getting 503s instead of timeouts [17:15:50] after the upgrade [17:15:57] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10AlexMonk-WMF) Histor... [17:16:00] Gerrit works for me now [17:16:02] RoanKattouw: Beause I just restarted it :p [17:16:03] paladox, it was restarting [17:16:06] it's backup [17:16:09] Oh, thanks [17:16:13] RoanKattouw: it takes a bit, try again in some secs [17:16:13] Yup WFM now [17:16:15] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2484247 (10Gehel) Old crons have been cleaned. Let's wait a bit to see if we have other errors before closing this. [17:16:46] I'm going to watch the task queue for awhile [17:16:51] And dig and see wtf set this off. [17:16:52] * greg-g really wanted to put "Status: Nominal" [17:16:52] maintenance mode for gerrit is being worked , btw [17:17:14] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:17:15] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:17:24] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [17:17:42] mutante: The maint_mode works, but only when we explicitly turn it on. [17:17:44] RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge. [17:17:48] When it breaks, it still breaks :) [17:18:00] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484278 (10mobrovac) [17:18:57] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484297 (10mobrovac) @GWicke please approve as the services team manager, @Nuria please approve as the analytics team manager (the team owning... [17:19:46] greg-g: no, it was not a thing for a task [17:20:07] ostriches: right :) [17:20:08] gerrit was broken in a really weird way... anyways a kick made it happy again [17:20:27] apergos: yeah, forgot you werejust going to kick it, it just happened again (most likely same cause?) [17:20:41] is it extension-dist? [17:20:46] ostriches: [17:20:46] No [17:20:48] huh [17:21:00] then dunno [17:21:23] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:21:23] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [17:21:27] I'm not entirely sure what happened. Same issue that happened the other day that Antoine noticed [17:21:38] Underlying cause still unclear. [17:21:46] bummer [17:21:54] maybe we do need a task so we can collect info [17:22:00] Symptoms: a few git-upload-pack start getting stuck. [17:22:05] Others pile up. [17:22:09] Queue gets unmanageable. [17:22:10] how can you tell they are stuck? [17:22:13] Gerrit gets wobble. [17:22:21] ostriches: do you know if there's any sort of metrics pushed associated with gerrit's jvm? [17:22:24] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:22:41] which is to say, how could I, not a gerrit expert, tell they are stuck? [17:22:46] apergos: Eg from `gerrit show-queue -w`: [17:22:48] 187b18f2 15:53:55.868 git-upload-pack p/mediawiki/core.git [17:22:51] ah ha [17:22:56] You'll see a few just sitting there. [17:23:01] gotcha [17:23:03] maybe we could implement a watchdog and solve it "by force" [17:23:07] And the rest don't have a start time and are just "waiting....." [17:23:27] if you create pileups/take too much time, kill [17:23:29] jynus: Doable. Or at least scriptable to do semi-automatically. [17:23:52] (if we want human involvement) [17:23:52] it doesn't even have to be a fully automatic thing [17:23:55] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2484308 (10mobrovac) Looping @Joe and @Dzahn in too. August works for me. @Joe, @akosiaris, @Dzahn ? The plan is the following. Convert 2 to 3 machines in eqiad a... [17:24:00] it can be an icinga check [17:24:01] Yeah, something like "You see it doing X, run Y" [17:24:02] :) [17:24:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [17:24:15] Er, icinga alerts to X, someone look, probably run Y [17:24:16] "more than X commands queued on gerrit" [17:24:20] yes [17:25:24] godog: Re jvm stats, no. Monitoring for gerrit is pretty old/rudimentary. Could possibly reuse some of the stuff we use on Elastic. [17:25:27] For basic JVM stuff. [17:25:31] oohhh [17:25:40] good that would save a bunch of digging around in traces and such [17:25:47] we can execute any script based on an icinga alert. possible in icinga. but we'd have to be _really_ sure that fully automatic is a good idea [17:26:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5099119 keys - replication_delay is 0 [17:26:08] At the /bare minimum/ let's add an icinga check for a queue longer than like 50. [17:26:11] probably a bad idea to start adopting that pattern (icinga -> fixscript) [17:26:12] ostriches: yep also I think would be useful to get jvm stats in graphite, for that I think we use jmxtrans with hadoop [17:26:15] Anything longer is definitely a problem. [17:26:22] I am all for alert + documentation unless it is too frequent (it is not) [17:26:29] (Probably could go shorter, but I'm afraid of false positives during bot actions like translatewiki) [17:26:38] !log krinkle@tin Synchronized w/static.php: allow short-lived caching of 400/500 errors (duration: 00m 24s) [17:26:38] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2484328 (10greg) [17:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:18] Ok, now how to expose that metric nicely... [17:27:32] that's two getting stuck in the same workday for me [17:28:07] apergos, there was a small outage before of m2 [17:28:14] show-queue is only over gerrit ssh. Which is A) Annoying because keypairs + access, and B) Not useful at all if SSH itself is lagging. [17:28:23] Wonder if it's in the RPC api. [17:28:25] yes this was ssh only all right [17:28:32] _j oe_ figured that out [17:28:41] And of course the .war file doesn't expose it [17:28:48] So can't just use java from cli. [17:28:51] jynus: when was the m2 outage? [17:29:03] of course you can't because that would make it easy, ostriches [17:29:07] god forbid [17:29:41] Nope, no useful endpoint. [17:29:42] :) [17:29:43] we could just focus on "it gets slow" like the humans detected it too [17:29:54] ostriches: Hmm now it seems that grrrit-wm doesn't work any more? [17:30:06] apergos, around here: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=17&fullscreen&from=1469107734548&to=1469110113985 [17:30:07] you can see that there are git fetches stacked up on palladium [17:30:29] I noticed that there is something that fires once a minute and it was stacking up, eventually the first ones would time out [17:30:35] but that's prety iffy to try to grab [17:30:37] RoanKattouw: that happens on every gerrit restart. [17:30:37] it's not like 50 [17:30:51] RoanKattouw: let me kick the bot [17:30:59] Thanks [17:31:18] that correlates to nothing I know of. hmm [17:31:48] RoanKattouw: It never works after a gerrit restart. [17:31:58] I have an outstanding offer of $20 to anyone who can make it auto-restart :p [17:32:21] oh wait, jynus was that the disk wipe? or am I misremembering? [17:32:56] pod "grrrit-wm-230500525-h411u" deleted [17:32:56] apergos, the link I sent you was mislading [17:32:59] ^ kubernetes [17:33:00] that was something else [17:33:08] !log restarted grrrit-wm [17:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:55] ETOOMANYISSUES [17:34:30] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2484392 (10demon) I'd prefer ju... [17:35:13] jynus: did it make it into a ticket someplace or not worth the bother? (if it did I'll follow along) [17:35:27] apergos, dbproxy1002? [17:35:47] yes the thing you were linking [17:35:55] but is misleading [17:36:04] gerrit was down between 12:43 and 12:58 [17:36:23] apergos, https://phabricator.wikimedia.org/T140983 [17:36:29] thanks [17:36:50] ok I remember this happening at the same time as toomanyissues [17:36:52] thanks [17:36:53] (03CR) 10Dzahn: "@mobrovac should i just merge it anytime?" [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [17:37:10] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2290725 (10greg) Add that T125003 subtask, but that might be the wrong one. Basically, we need to make sure Beta Cluster is updated before. [17:37:29] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2484399 (10Pavanaja) >>! In T140898#2482774, @Dzahn wrote: > copying verbatim comment from @Glaisher on T134017#2253719 > > --- > > Could someone prov... [17:39:18] 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2484406 (10jcrespo) I am checking times, according to logs (request numbers are too low) gerrit and OTRS were down between 12:43 and 12:58. [17:39:27] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484407 (10GWicke) Approved. [17:43:19] (03CR) 10Dzahn: "translations for namespaces have been provided now on https://phabricator.wikimedia.org/T140898#2484399 should that be also included here " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [17:45:26] (03CR) 10Mobrovac: "As soon as tcy.wikipedia.org is up and kicking, yes. If that happens today, you can coordinate with Gabriel, Petr or Eric E. to restart RB" [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn) [17:45:41] 06Operations, 10Flow, 10MediaWiki-Redirects, 03Collab-Team-Q1-July-Sep-2016, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#2484432 (10jmatazzoni) [17:50:23] 06Operations, 10VisualEditor, 07Performance: fix puppet run on osmium (by either providing or removing chromium package on jessie) - https://phabricator.wikimedia.org/T141023#2484480 (10Dzahn) [17:52:30] ACKNOWLEDGEMENT - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T141023 [17:53:23] 06Operations, 10VisualEditor, 07Performance: fix puppet run on osmium (by either providing or removing chromium package on jessie) - https://phabricator.wikimedia.org/T141023#2484498 (10Dzahn) [17:54:07] 06Operations, 10VisualEditor, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2484480 (10Dzahn) [17:57:06] (03CR) 10EBernhardson: [C: 031] Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 (owner: 10MaxSem) [18:01:19] RECOVERY - Ensure legal html en.m.wp on en.m.wikipedia.org is OK: all html is present. [18:04:29] 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484539 (10Nuria) @mobrovac . Approved on my end. Note that analytics folks (devs, not devops) also need permits, this includes: @mforns , @Mi... [18:05:43] 06Operations, 03Discovery-Search-Sprint, 13Patch-For-Review: Elasticsearch index indexing slow log generates too much data - https://phabricator.wikimedia.org/T117181#2484569 (10debt) 05Open>03Resolved resolving this one (was still open but in the resolved column) [18:09:33] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: Only use newer (elastic10{16..47}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2484587 (10debt) 05Open>03Resolved [18:11:57] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2484591 (10debt) [18:12:00] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, and 2 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484590 (10debt) 05Open>03Resolved [18:12:52] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2484595 (10ori) p:05Low>03High >>! In T140942#2483038, @Gehel wrote: > Triaging this as low priority to match T117470. No, this should definitely have a higher... [18:13:03] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2484612 (10debt) [18:13:05] 06Operations, 06Discovery, 10Wikimedia-Logstash, 03Discovery-Search-Sprint, and 2 others: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001#2484609 (10debt) 05Open>03Resolved a:03debt [18:15:32] (03PS1) 10Dzahn: jsbench: chromium-browser on trusty, chromium on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) [18:17:41] (03PS2) 10Dzahn: jsbench: chromium-browser on trusty, chromium on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) [18:19:19] (03CR) 10Ori.livneh: [C: 031] "tnx" [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) (owner: 10Dzahn) [18:20:37] (03PS1) 10Chad: Gerrit: Further tweaks to down/maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/300323 [18:20:53] (03CR) 10Dzahn: [C: 032] jsbench: chromium-browser on trusty, chromium on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) (owner: 10Dzahn) [18:22:24] (03CR) 10Chad: [C: 04-2] "My fear with this approach is that things will fail silently or in unexpected ways. Rather we should just ensure this is always available " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson) [18:22:25] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:24:16] (03CR) 10Elukey: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [18:30:46] So block conflicts seem to be having a SQL database query problem? [18:31:10] I mean, I end up getting one when I block someone exactly at the same time as someone else [18:31:24] Just got one [18:31:27] "Function: IndexPager::buildQueryInfo (LogPager)" [18:31:36] "Error: 2013 Lost connection to MySQL server during query (10.64.32.25)" [18:32:07] (03PS3) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [18:32:08] Oh, right, there's a phab thing for this issue. [18:32:55] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10Paladox) Could someone do the images please? We need the normal image, a 1.5x image and 2x image please. By image I mean logo please. [18:34:21] (03CR) 10Chad: "https://puppet-compiler.wmflabs.org/3432/ shows no real changes except addition of 503 directive to default apache config. I'm pretty sure" [puppet] - 10https://gerrit.wikimedia.org/r/300323 (owner: 10Chad) [18:36:32] (03CR) 10Paladox: [C: 031] "Looks all good, and we will have a better looking maintenance page too." [puppet] - 10https://gerrit.wikimedia.org/r/300323 (owner: 10Chad) [18:41:02] ostriches: for 299996, in PS1, I offered a more conservative approach: use require_once and explicitely not require it when run from a maintenance script [18:41:18] ostriches: https://gerrit.wikimedia.org/r/#/c/299996/1/wmf-config/wikitech.php [18:42:37] Heh, that could be one lined into defined( 'DO_MAINTENANCE' ) || include_once( ... ) [18:42:47] Which I guess works around the error, but still doesn't solve my problem. [18:42:53] If the file should be loaded, it should always be loaded. [18:42:58] Not just because the file DNE. [18:43:05] Or we're doing maintenance. [18:43:19] I'd /rather/ it fail hard and fast than unexpectedly and quiet. [18:43:39] (03PS1) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) [18:44:52] (03CR) 10jenkins-bot: [V: 04-1] Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [18:46:02] omg, did not align arrows [18:47:15] RECOVERY - configured eth on relforge1001 is OK: OK - interfaces up [18:47:34] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [18:47:56] RECOVERY - dhclient process on relforge1001 is OK: PROCS OK: 0 processes with command name dhclient [18:48:05] RECOVERY - DPKG on relforge1001 is OK: All packages OK [18:48:25] RECOVERY - Check size of conntrack table on relforge1001 is OK: OK: nf_conntrack is 0 % full [18:48:44] RECOVERY - Disk space on relforge1001 is OK: DISK OK [18:49:32] (03CR) 10Ottomata: [C: 031] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [18:52:44] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:55:20] (03PS2) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) [18:57:38] ostriches: what about create a Puppet class to provision an empty /etc/mediawiki/WikitechPrivateSettings.php file, add it to deployment::server (tin, mira), mediawiki::maintenance (terbium, wasat, mw1152) roles? [18:58:41] That still doesn't solve the problem. It should be on all MW nodes. [18:59:01] An empty file just means we (once again) fail quietly because we're misconfigured. [19:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1900). Please do the needful. [19:03:59] Dereckson: And if it shouldn't be on all MW nodes, then those nodes (maintenance, deploy masters) shouldn't be able to mess with it via maintenance. [19:04:07] (which also seems wrong if tin/mira cant) [19:04:25] (03PS2) 10Chad: Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 [19:04:45] (03CR) 10Jforrester: "Do we know what the current regular rates of these are?" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [19:05:46] (03CR) 10Ori.livneh: "@Jforrester, https://graphite.wikimedia.org/S/Bf" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [19:05:56] RECOVERY - NTP on relforge1001 is OK: NTP OK: Offset -0.01760518551 secs [19:07:18] (03CR) 10Chad: [C: 032] Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 (owner: 10Chad) [19:07:41] ori: Thanks. Does that mean we'll get pages several times a day at current rates? [19:07:50] (03Merged) 10jenkins-bot: Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 (owner: 10Chad) [19:08:48] From eyeballing that, 02:00, 04:30 (big), 09:30 (big), 13:00 (just), 15:00 in the last 24 hours. [19:09:31] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: last wikis to wmf.11 [19:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:39] James_F: Yeah. I'm glad you're looking -- this could use another pair of eyes. What do you think the thresholds should be? [19:10:09] (03PS1) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [19:10:22] Maybe start at 25/50? [19:10:31] ori: Well, I think the thresholds are if anything too high, but… eh. Maybe 20/40 instead of 15/25? [19:10:34] 'warn' is mostly meaningless, since the threshold for alerting on irc / paging is crit [19:10:35] Or what Chad said. [19:10:48] From what I see in logstash's mw error channels, that seems low enough to trigger for Bad Stuff, but high enough to not needlessly flap (which we probably would do at first) [19:10:58] * ori nods [19:11:01] sounds good, I'll update the patch [19:11:02] Can we get IRC pings for warn as well on this one? [19:11:10] (If that's hard, never mind.) [19:11:14] ori: I'm all for lowering that in time. [19:11:19] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [19:11:19] +1 [19:11:20] less errors -> happy chad [19:11:39] And happy users -> happy James. [19:11:48] yeah, but I understand the concern -- if this is too noisy it'll train people to ignore it [19:12:12] (03PS3) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) [19:12:25] I'm also interested in possibly moving alerts about the MW part of the stack to a different IRC channel to the ones about the metal. [19:12:32] We should also run down more of these "failed to connect to redis" ones. [19:12:32] (03CR) 10Dereckson: "@MarcoAurelio @Paladox We reversed the VE logic: all wikis now have it, instead those in visualeditor-nondefault.dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [19:12:41] Either they don't need to log or they need to be more annoying. [19:12:42] (03PS4) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) [19:12:47] Right now I mostly see them as spam in logstash [19:13:00] 'Cos this channel has notifications about puppet (which mere deployers can't do anything about) and about deployments (which they can). [19:13:35] (03CR) 10Dereckson: "(excepted)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [19:13:36] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2484831 (10Paladox) >>! In T140898#2484399, @Pavanaja wrote: >>>! In T140898#2482774, @Dzahn wrote: >> copying verbatim comment from @Glaisher on T13401... [19:13:37] I come back to this channel and there are often >1k messages over night since I log off around 22:00. [19:13:46] we can echo the alerts on additional channels, but I would like -operations to provide a synoptic view of site reliability [19:14:00] so I'd add channels rather than move it [19:14:01] Oh, sure. It's mostly a call for ostriches and the rest of RelEng. [19:14:18] Absolutely, don't want to reduce the value of this channel to others. [19:14:26] * ostriches already reads all the things [19:14:29] * ostriches also has no life [19:15:00] if one of you +1s i'll merge it [19:15:11] (03CR) 10Jforrester: [C: 031] Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [19:15:27] (03PS5) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) [19:15:30] (03CR) 10Chad: [C: 031] "+1s is almost sorta (nothing like) a +2" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [19:16:03] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2484834 (10RobH) [19:16:05] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, and 2 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484832 (10RobH) 05Resolved>03Open #debt: this is not closed, as I have not finished the decommission process. Please don't resolve this task. [19:16:55] ori: How much do you know about sms paging from icinga? [19:17:13] not much, but ask anyway [19:17:34] Can we make a group for releng that *does* SMS paging for releng? [19:17:34] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484836 (10RobH) [19:17:38] Eh, said releng twice. [19:17:46] yes [19:18:21] I know I get e-mails when PHD dies (or did), but I need something more eye-catching [19:18:23] like sms :p [19:18:45] see the git log for modules/nagios_common/files/contactgroups.cfg [19:18:57] Yeah I'm in there in a couple of groups. [19:19:03] (03CR) 10Greg Grossmeier: "Right now that's (over the last 24hours) (eye-balling here):" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [19:19:05] I'm just not sure how that ties to sending me a text [19:19:17] there is one special contact group called "sms" [19:19:23] if you are in there you get paged [19:19:31] but then you get all the ops pages currently [19:19:46] Yeah, that's not what I want. I want something like sms-releng [19:19:52] (or able to trigger sms for random groups) [19:19:58] Whichever is possible [19:20:12] (03CR) 10Ori.livneh: [C: 032] Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh) [19:20:42] yea, we dont have that as a feature yet [19:20:44] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484855 (10debt) Whoops, sorry! Please continue doing what needs to be done and thanks for removing the search tags. :) [19:20:49] it's a bit complicated [19:20:53] I thought we had service groups? [19:21:13] I only get (got?) Parsoid pages, not general Ops ones. [19:21:17] but apparently only one that is hooked up to SMS [19:21:34] ostriches: so you'll need an opsen to pusht he contacts.cfg changes to add the individual data for each sms person [19:21:40] ack, i got stuck in backlog [19:21:42] we have service groups for teams and stuff [19:21:44] sorry, outdated comment! [19:21:49] with email notification [19:21:59] Ah, but the groups are not for SMS? OK. [19:22:03] but we need to add the SMS notification method [19:22:05] yes [19:22:10] mutante: uh, i thought we paged some folks not in ops already? [19:22:20] like services? (maybe im misrecalling) [19:22:35] oh, james asked shit im on a delay it seems. [19:22:43] * robh is just not with it today [19:24:20] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/nagios_common/files/contactgroups.cfg shows only Opsen getting SMSes. [19:24:37] robh:do we? it's possible that we added it to an individual contact.. [19:25:10] James_F: Yeah, that's the sms group. Which I could technically add myself to, but I don't need alerts with cr1 goes flapping (for example). Nothing I can do about it. [19:25:18] yea, first we had no custom groups at all.. then we did that with email [19:25:23] ostriches: Indeed. [19:25:27] i thought we had made it more granular for individual pages [19:25:38] but i never got anythig but all the pages so i could be easily mistaken. [19:25:50] how hard is it to add sms-groups? [19:25:58] seems like an obvious thing for services, no? [19:26:14] not easy enough [19:26:23] i remember we looked before [19:26:32] lets make a ticket , will also check for an old one [19:26:47] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:26:54] mutante: thanks, I'll subscribe :) [19:27:06] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2484860 (10Jdforrester-WMF) Osmium is now fixed, so this can be closed? Thank you. [19:27:19] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484861 (10RobH) [19:27:42] ori: If you want to create a "mediawiki" group for those alerts, please add me to it (even if it doesn't get SMSes). [19:27:55] !log demon@tin Synchronized wikiversions.json: because sync-wikiversions doesn't care about co-masters ugh (duration: 00m 29s) [19:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:56] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:29:11] robh: ostriches: just remembred something.. so we had this same question for ores and h.alfak [19:29:38] and what he did is get email from icinga and then forward it to mail2sms gateway himself [19:29:53] so that results in paging without us having it implemented like that [19:30:20] ha! well, if they are a US carrier, we can put their contact email address as their sms email [19:30:23] thats a work around as well [19:30:30] yes, that [19:30:35] actually, can do that with either but it will be a messy format [19:30:40] which may be non ideal. [19:30:43] the notification type SMS in icinga. is also jsut email [19:30:54] just a special type of email that turns it into an SMS [19:31:03] (03PS1) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) [19:31:12] well, its also a more terse format for short text reading [19:31:16] yes? [19:31:16] (03PS2) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) [19:31:21] but otherwise same content overall [19:31:22] and depending on the provider it's just something like @txt.att.com etc [19:31:38] so its still non ideal due to formatting i would think. [19:31:52] (03PS3) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) [19:32:01] (03PS4) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) [19:32:08] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) (owner: 10Yuvipanda) [19:34:18] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484870 (10RobH) No worries, I figured removing the tags would clear it from your workboards/radar =] [19:39:04] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484884 (10RobH) a:05RobH>03Cmjohnson [19:39:37] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2441721 (10RobH) Assigned to @cmjohnson for ssh removal/disk wipe before unracking. Once they are unracked (and added to decom tracking sheet), their mgmt dns entries can be pulled. [19:40:11] robh: that is true about formatting.. and i found the line where the format is set. we can do something there [19:40:32] like host-notify-by-sms-gateway-SERVICE [19:40:42] but on ticket is good [19:41:08] * robh is decommissioning all the things [19:41:24] 06Operations, 10Icinga: implement icinga paging for non-ops teams - https://phabricator.wikimedia.org/T141038#2484903 (10Dzahn) [19:41:29] robh: :) [19:41:35] see, we gave cmjohnson1 enough time to dig out from under a pile of new metal. now we're gonna bury him under old metal. [19:41:39] (03PS6) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [19:41:57] (03CR) 10jenkins-bot: [V: 04-1] Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [19:41:59] hahaha [19:43:33] let's just get that old metal off the racks so I can get rid of it all at one time [19:44:06] yeah i just pushed the one for the elastic1001-1016 for you to wipe and unrack (or remove ssds and unrack) [19:44:12] (03PS7) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [19:44:17] more things to pull, huzzah [19:46:02] (03CR) 10Jforrester: [C: 031] Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [19:49:39] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:53:12] (03CR) 10Dereckson: [C: 04-1] "Some images issues to fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [19:55:30] (03CR) 10Dereckson: "Namespaces should use space, not underscore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [19:55:59] (03CR) 10Dereckson: "Namespaces should use underscores, not spaces." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox) [20:01:18] ostriches: I'm going to throw some 50 MWExceptions from eval.php on tin just to test the alert [20:01:24] k [20:01:36] James_F: Why do we have securepollglobal.dblist? We've it's basically all.dblist - (loginwiki, labswiki, labstestwiki, zerowiki) [20:02:04] ostriches: Probably it's used for the maintenance script and it was easier for you/Reedy/Roan at the time. ;-) [20:02:40] Ah could be [20:02:54] ostriches: There are several low-value dblists it'd be nice to kill. [20:03:30] ostriches: And v.v., it might be sensible to define "all - loginwiki - votewiki" as a list, given how often we set that in InitSettings. [20:04:01] Yeah, I'm not opposed to keeping lists around if they can use expressions [20:04:21] It's mainly: adding to all.dblist should result in expected defaults, not "you also gotta add it to foo" [20:04:27] No all - nonglobal ? [20:04:44] Yup. [20:04:54] Hence why I moved to ve-nondefault. [20:04:55] Etc. [20:05:00] chad@notsexy /a/ops/mediawiki-config/dblists (master)$ diff all.dblist securepollglobal.dblist [20:05:00] 438,439d437 [20:05:00] < labswiki [20:05:00] < labtestwiki [20:05:00] 464d461 [20:05:01] < loginwiki [20:05:01] 879d875 [20:05:02] < zerowiki [20:06:32] (03PS24) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (https://phabricator.wikimedia.org/T138501) [20:07:28] Yeah, so securepollglobal.dblist is used in SecurePoll for page creation or something [20:07:35] It needs a list of db names. [20:07:44] But that list seems wrong as-is. [20:09:06] (03CR) 10Gehel: [C: 032] Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel) [20:09:28] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2484957 (10Dzahn) I wanted to check one last thing. i saw "chromium-browser" was used in a script. getting on this now [20:10:36] (03PS2) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [20:10:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [20:10:52] \o/ [20:11:00] tgr: ^ [20:11:10] (just a test) [20:11:15] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [20:12:14] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [20:14:10] (03PS8) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [20:16:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:22:07] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [20:22:35] RECOVERY - salt-minion processes on relforge1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:23:07] (03CR) 10Jforrester: Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester) [20:29:57] PROBLEM - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [20:31:53] (03PS9) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [20:35:04] (03PS3) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [20:38:18] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2485115 (10ori) [20:40:54] (03PS10) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [20:40:56] (03PS1) 10Andrew Bogott: Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 [20:48:35] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:27] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2485183 (10Dzahn) There is a custom upstart script to start chromium-browser that is puppetized. But that needs to be... [20:53:28] (03PS4) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [20:53:58] (03PS2) 10Andrew Bogott: Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 [20:54:49] 06Operations, 10ops-eqiad: db1011 disk failure (degraded RAID) - https://phabricator.wikimedia.org/T141046#2485211 (10jcrespo) [20:55:39] ACKNOWLEDGEMENT - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T141046 [20:56:01] (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [20:58:11] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2485251 (10jcrespo) ``` megacli -PDRbld -ShowProg -PhysDrv'[32:6]' -a0 Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 98% in 908 Minutes. ``` [20:59:32] 06Operations, 10ops-eqiad, 10DBA, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2485255 (10jcrespo) 05stalled>03Resolved a:03Cmjohnson [21:00:55] (03PS1) 10Gehel: Actually create initial import script for OSM data [puppet] - 10https://gerrit.wikimedia.org/r/300410 (https://phabricator.wikimedia.org/T138501) [21:01:28] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2485260 (10jcrespo) It seems dbproxy1002 was "accidentally" upgraded to jessie today: T140983 [21:03:41] (03PS5) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) [21:05:05] 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2485265 (10jcrespo) We need to revert https://gerrit.wikimedia.org/r/300254 once we check everything is working and have a window where it is not disruptive. [21:05:45] (03CR) 10MaxSem: [C: 031] Actually create initial import script for OSM data [puppet] - 10https://gerrit.wikimedia.org/r/300410 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel) [21:08:05] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2485278 (10Tgr) Thinking about this more, not sure if login/signup metrics are worth the effort. One of the strengths of Wikimedia is the stro... [21:08:12] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2485279 (10jcrespo) [21:08:17] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2485280 (10jcrespo) [21:08:52] (03PS2) 10Reedy: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) [21:08:56] (03PS3) 10Reedy: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) [21:09:58] (03CR) 10Reedy: [C: 032] "Removed dependency so this can go out first (co-exists with config already in SiteMatrix no issue)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) (owner: 10Reedy) [21:10:41] (03CR) 10Gehel: [C: 032] Actually create initial import script for OSM data [puppet] - 10https://gerrit.wikimedia.org/r/300410 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel) [21:10:57] (03Merged) 10jenkins-bot: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) (owner: 10Reedy) [21:11:54] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Moved WMF specific SiteMatrix data to CommonSettings (duration: 00m 26s) [21:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:18:41] (03PS1) 10Gehel: Maps - initial import script [puppet] - 10https://gerrit.wikimedia.org/r/300423 (https://phabricator.wikimedia.org/T138501) [21:23:51] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:26:02] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:26:12] ^ maps2001 is me, patch coming up... [21:26:30] (03PS1) 10Dzahn: jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) [21:26:38] (03CR) 10Gehel: [C: 032] Maps - initial import script [puppet] - 10https://gerrit.wikimedia.org/r/300423 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel) [21:27:22] (03PS2) 10Dzahn: jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) [21:28:24] 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2485421 (10Jdforrester-WMF) >>! In T140942#2485278, @Tgr wrote: > Thinking about this more, not sure if login/signup metrics are worth the eff... [21:30:00] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:31:03] (03PS3) 10Dzahn: admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) [21:31:26] (03CR) 10Dzahn: [C: 032] admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn) [21:35:46] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 3 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2485435 (10greg) [21:43:55] (03PS3) 10Dzahn: remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) [21:48:19] (03PS3) 10Andrew Bogott: Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 [21:49:35] (03CR) 10Dzahn: [C: 032] remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) (owner: 10Dzahn) [21:50:38] (03CR) 10Andrew Bogott: [C: 032] Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 (owner: 10Andrew Bogott) [21:52:34] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:52:59] (03PS2) 10BBlack: puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 [21:53:08] (03CR) 10BBlack: [C: 032 V: 032] puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 (owner: 10BBlack) [21:58:59] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2485532 (10GWicke) After investigating this for a while I am now fairly certain that that the master process exit was indeed caused by a DNS resoluti... [22:01:36] !log stat1002 - puppetized git pull from "refinery_source" fails [22:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:44] RECOVERY - MegaRAID on dbstore1002 is OK: OK: optimal, 1 logical, 2 physical [22:19:46] "Please do not submit patches through LinkedIn, or at the very least submit it as an unified diff" hahah [22:21:04] ? [22:21:15] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.113:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.113, port=9200): Read timed out. (read timeout=4) [22:21:48] greg-g it is on someone linkedin profile [22:21:59] who works for wikimedia foundation [22:22:08] !log Restarted kibana4 on logstash1001 for "node[18588]: segfault at 2fcb25f00009 ip 0000000000ad9846 sp 00007ffe526bbb40 error 4 in node[400000+1383000]" [22:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:04] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.32.137, port=9200): Read timed out. (read timeout=4) [22:28:20] greg-g: hashar had to put that on his profile. i guess they tried to send him patches that way [22:28:59] LOL [22:38:48] 06Operations, 10Analytics-Cluster: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2485674 (10Dzahn) [22:39:14] 06Operations, 10Analytics-Cluster: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2485686 (10Dzahn) [22:40:30] ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T141062 [22:42:06] (03PS1) 10ArielGlenn: fix up base wiki handling for onallwikis [dumps] - 10https://gerrit.wikimedia.org/r/300437 [22:42:26] CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.64.32.137', port=9200): Read timed out. (read timeout=4) [22:42:36] are these decom'ed elasticsearch servers? [22:42:51] because i see in backlog things like "elastic1001-1016 for you to wipe and unrack" [22:43:07] mutante: if its 1001-1016, then yes. checking if it is [22:43:30] http://10.64.32.137:9200/ and http://10.64.48.113:9200/_ [22:43:52] 1002, 1003 [22:43:57] mutante: thats logstash1002 actually [22:44:03] something is up with that server, looking [22:44:08] oh, right [22:44:09] thanks [22:44:16] Er, bd808 ^ [22:46:20] !log restart elasticsearch on logstash1002 [22:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:48] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 26, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards [22:50:05] looks like it hit a java OOM, then had some issues. logstash1001-3 have their heap set to 2G, might be worthwhile to increase it. Will have to check with bd808 on that though [22:50:26] elastic heap at 2g? [22:50:28] (03PS3) 10Dzahn: Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [22:50:30] how does it liveeeeee? [22:50:35] ebernhardson: up to you :) you touched it last [22:50:38] ostriches: 1001-3 aren't data nodes [22:50:50] and also ostriches is breaking stuff for fun I think ;) [22:50:51] Speaking of heap, I should raise gerrit's on lead. [22:50:51] they are basically just routers [22:51:13] bd808: Yeah, making a visualization. A lot to ask for a log visualization platform :p [22:51:19] the new version of es we deployed might be a bit more memory hungry than the 1.7 [22:51:30] *nod* [22:52:09] logstash1001 is showing 4G of free ram [22:52:49] maybe bump from 2G to 4G? [22:53:21] those nodes should really only need ES ram to do aggregations but maybe we are doing more now [22:54:15] 07Blocked-on-Operations, 06Operations, 10Parsoid, 10Salt: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#2485714 (10ggellerman) [22:54:27] yea i created a ticket to bump from 2G to 4G. The increased usage of aggregations makes sense for pushing it up [22:57:04] (03PS1) 10EBernhardson: Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) [23:00:05] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T2300). [23:00:05] Addshore, Jdlrobson, Pchelolo, James_F, MaxSem, and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] (03PS4) 10Dzahn: Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [23:00:18] I'm here [23:00:40] * MaxSem looks around [23:00:51] MaxSem: that looks like a volunteer! [23:01:02] here [23:01:02] Heya. [23:01:03] aaaaaaaaaaaaaaá [23:02:33] (03PS1) 10Ppchelko: Change-Prop: Definition rerender bug - don't react to revision change [puppet] - 10https://gerrit.wikimedia.org/r/300442 [23:02:59] how is it 4pm already? [23:03:02] *waves* [23:04:39] okay, sent all the extension patches to zuul [23:04:42] (03CR) 10Dzahn: [C: 032] Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [23:05:40] (03CR) 10MaxSem: [C: 032] RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore) [23:06:14] (03PS4) 10MaxSem: RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore) [23:06:22] (03CR) 10MaxSem: [C: 032] RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore) [23:06:35] did I already mention how I hate this new setting? [23:06:45] which? [23:06:55] must rebase before merging [23:07:06] (03Merged) 10jenkins-bot: RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore) [23:07:56] addshore, pulled on mw1099 [23:08:00] checking [23:08:40] looks good MaxSem [23:09:25] ostriches: heapLimit = 20g [23:09:32] is that it (and a lot ?) [23:09:49] Gah, laptop is picking an unfortunate time to reboot. [23:09:50] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/298933/ (duration: 00m 29s) [23:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:00] mutante: Yeah that's it. And nah it's not a lot :P [23:10:09] addshore, deployed [23:10:15] *checks* [23:10:34] looks good! [23:10:44] \m/ [23:11:48] thanks MaxSem ! [23:12:06] * aude waves [23:12:10] Hi. [23:12:14] * addshore wave toward aude [23:12:21] (03PS4) 10MaxSem: Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [23:12:22] * aude goes to enable revisionslider on arwiki and dewiki [23:12:29] (03CR) 10MaxSem: [C: 032] Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [23:12:51] aude, woo! :) [23:12:52] (03PS1) 10Andrew Bogott: Catch liberty designate.conf up to the state of the art. [puppet] - 10https://gerrit.wikimedia.org/r/300444 [23:13:10] (03Merged) 10jenkins-bot: Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [23:13:37] ebernhardson: thanks for cherry-picking Fix Searcher::$searchContext visibility to wmf11 :) [23:13:43] Dereckson: np [23:14:03] jdlrobson, pulled on mw1099 [23:14:09] looking [23:15:04] (03CR) 10Andrew Bogott: [C: 032] Catch liberty designate.conf up to the state of the art. [puppet] - 10https://gerrit.wikimedia.org/r/300444 (owner: 10Andrew Bogott) [23:16:41] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2485842 (10Neil_P._Quinn_WMF) @Dzahn, thank you! One question: as far as I can tell, the patch creates a new shell account for Helen... [23:19:06] !log restarting uwsgi and celery for ores in scb 1001 [23:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:18] looks good MaxSem [23:20:29] !log maxsem@tin Synchronized dblists/wikidatadescriptions.dblist: https://gerrit.wikimedia.org/r/#/c/299615/ (duration: 00m 24s) [23:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:09] !log restarting uwsgi and celery for ores in scb1002 [23:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:58] (03PS1) 10Dzahn: gerrit: up heap size limit from 20GB to 28GB [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) [23:22:23] !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/299615/ (duration: 00m 29s) [23:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:34] jdlrobson, ^ [23:22:42] MaxSem: checking once more [23:23:26] sweet. That ones done [23:24:28] (03PS2) 10MaxSem: Lazy load images+references on Russian Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299619 (https://phabricator.wikimedia.org/T140197) (owner: 10Jdlrobson) [23:24:53] (03CR) 10MaxSem: [C: 032] Lazy load images+references on Russian Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299619 (https://phabricator.wikimedia.org/T140197) (owner: 10Jdlrobson) [23:25:29] (03Merged) 10jenkins-bot: Lazy load images+references on Russian Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299619 (https://phabricator.wikimedia.org/T140197) (owner: 10Jdlrobson) [23:26:03] jdlrobson, pulled on mw1099 [23:26:09] MaxSem: checking [23:26:30] MaxSem: and verified! [23:27:18] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/299619/ (duration: 00m 24s) [23:27:20] jdlrobson, ^ [23:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:50] aawesome! thanks Max :) [23:29:36] Pchelolo and ebernhardson, pulled on mw1099 [23:29:43] MaxSem: checking [23:30:18] MaxSem: mine isn't really testable, it only effects job queue [23:30:32] cheater! [23:31:43] MaxSem: tested all I could, looks ok. [23:32:44] !log maxsem@tin Synchronized php-1.28.0-wmf.11/extensions/EventBus/: https://gerrit.wikimedia.org/r/#q,300332,n,z (duration: 00m 26s) [23:32:46] Pchelolo, ^ [23:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:00] thank you MaxSem, I'll monitor the logs [23:33:39] MaxSem: Laptop is now "calculating" how long it'll be offline. :-( [23:34:25] !log maxsem@tin Synchronized php-1.28.0-wmf.11/extensions/CirrusSearch/: https://gerrit.wikimedia.org/r/#q,300430,n,z https://gerrit.wikimedia.org/r/#q,300436,n,z (duration: 00m 32s) [23:34:27] ebernhardson, ^ [23:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:54] MaxSem: thanks, i'll keep an eye on the logs [23:35:08] James_F, drop by [23:35:35] MaxSem: all look great, thank you [23:35:56] (03PS2) 10MaxSem: Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester) [23:36:11] (03CR) 10MaxSem: [C: 032] Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester) [23:36:42] (03PS6) 10Dzahn: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [23:36:46] (03Merged) 10jenkins-bot: Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester) [23:37:35] !log Restarted statsv on hafnium (cc Krinkle). 'gaierror: [Errno -3] Temporary failure in name resolution' [23:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:42] !log created ShortUrl tables on urwiki [23:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:24] !log on tin: ran mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=urwiki [23:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:01] (03CR) 10BryanDavis: [C: 031] Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) (owner: 10EBernhardson) [23:44:50] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2485889 (10Dzahn) No, you are right about that. Just that the creation of the user and adding it to groups has to be in separate patc... [23:46:06] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#q,298344,n,z (duration: 00m 24s) [23:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:46:37] (03PS1) 10Ppchelko: Change-Prop: Revert the revert - ignore bots on ORES [puppet] - 10https://gerrit.wikimedia.org/r/300450 [23:47:38] jouncebot: status [23:49:11] (03CR) 10Dzahn: [C: 032] Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [23:49:19] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 24s) [23:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:16] (03PS2) 10MaxSem: Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 [23:50:24] (03CR) 10MaxSem: [C: 032] Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 (owner: 10MaxSem) [23:50:34] (03PS2) 10MaxSem: Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 [23:50:42] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 (owner: 10MaxSem) [23:51:16] (03PS2) 10MaxSem: Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 [23:51:24] (03CR) 10MaxSem: [C: 032] Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 (owner: 10MaxSem) [23:51:26] (03Merged) 10jenkins-bot: Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 (owner: 10MaxSem) [23:51:31] (03Merged) 10jenkins-bot: Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 (owner: 10MaxSem) [23:51:41] (03PS2) 10MaxSem: Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 [23:51:51] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 (owner: 10MaxSem) [23:52:30] (03PS2) 10MaxSem: Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 [23:52:37] (03CR) 10MaxSem: [C: 032] Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 (owner: 10MaxSem) [23:52:43] (03Merged) 10jenkins-bot: Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 (owner: 10MaxSem) [23:52:45] (03PS2) 10MaxSem: Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 [23:52:50] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 (owner: 10MaxSem) [23:52:58] (03PS2) 10MaxSem: Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 [23:53:05] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 (owner: 10MaxSem) [23:53:14] (03Merged) 10jenkins-bot: Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 (owner: 10MaxSem) [23:53:16] (03PS2) 10MaxSem: Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 [23:53:24] (03CR) 10MaxSem: [C: 032] Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 (owner: 10MaxSem) [23:53:26] swat .. swat [23:53:35] (03PS2) 10MaxSem: Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 [23:53:44] (03CR) 10MaxSem: [C: 032] Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 (owner: 10MaxSem) [23:53:46] !log deploying 2d9817b to ores in scb nodes [23:53:49] How many patches? :P [23:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:01] only ten.. I know, lame [23:54:04] (03Merged) 10jenkins-bot: Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 (owner: 10MaxSem) [23:54:10] (03Merged) 10jenkins-bot: Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 (owner: 10MaxSem) [23:54:12] (03PS2) 10MaxSem: Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 [23:54:18] (03Merged) 10jenkins-bot: Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 (owner: 10MaxSem) [23:54:20] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 (owner: 10MaxSem) [23:54:30] a full cleanup would be like 50 [23:54:41] (03PS2) 10MaxSem: Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 [23:54:47] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem) [23:54:57] (03PS2) 10MaxSem: Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 [23:55:05] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 (owner: 10MaxSem) [23:55:15] (03PS2) 10MaxSem: Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 [23:55:18] (03PS1) 10Andrew Bogott: Rename oslo.config to oslo_config [puppet] - 10https://gerrit.wikimedia.org/r/300453 [23:55:19] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: puppet fail [23:55:25] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 (owner: 10MaxSem) [23:55:30] (03Merged) 10jenkins-bot: Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 (owner: 10MaxSem) [23:55:37] (03Merged) 10jenkins-bot: Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 (owner: 10MaxSem) [23:55:42] Reedy: All the patches. [23:55:43] (03Merged) 10jenkins-bot: Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 (owner: 10MaxSem) [23:56:16] MaxSem: I bet you know the answer to this: https://phabricator.wikimedia.org/T139552#2484944 [23:56:49] (03Merged) 10jenkins-bot: Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem) [23:56:54] (03Merged) 10jenkins-bot: Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 (owner: 10MaxSem) [23:57:59] (03Merged) 10jenkins-bot: Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 (owner: 10MaxSem) [23:58:09] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: puppet fail [23:58:21] waiting for integration.wikimedia.org ... [23:58:43] kaldari, yes [23:58:50] i guess it's busy with Max patche s:) [23:58:52] (03PS2) 10Andrew Bogott: Rename oslo.config to oslo_config [puppet] - 10https://gerrit.wikimedia.org/r/300453