[00:00:05] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T0000).
[00:03:57] <grrrit-wm>	 (03PS1) 10Dzahn: admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) 
[00:05:51] <grrrit-wm>	 (03PS2) 10BryanDavis: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 
[00:11:57] <grrrit-wm>	 (03CR) 10BryanDavis: "I have cherry-picked the patch to deployment-puppetmaster (and fixed a syntax error from PS1)." [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis)
[00:48:06] <grrrit-wm>	 (03CR) 10Jforrester: [C: 04-1] Initialize configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[00:49:15] <grrrit-wm>	 (03CR) 10Dereckson: "Could you follow 9483358b3f80d85c2e5be1515a265a5b512f132f for commit message format?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[00:51:16] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] Initialize configuration for tcy.wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[00:52:28] <grrrit-wm>	 (03CR) 10Dereckson: Initialize configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[01:06:27] <grrrit-wm>	 (03PS3) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[01:09:57] <grrrit-wm>	 (03CR) 10Paladox: Initial configuration for tcy.wikipedia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[01:10:03] <grrrit-wm>	 (03PS4) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[01:10:10] <grrrit-wm>	 (03CR) 10Dereckson: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[01:11:19] <grrrit-wm>	 (03CR) 10Paladox: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[01:12:38] <grrrit-wm>	 (03CR) 10Dereckson: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[01:12:48] <grrrit-wm>	 (03CR) 10Dereckson: Initial configuration for tcy.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[01:13:48] <grrrit-wm>	 (03PS5) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[01:17:35] <wikibugs>	 06Operations, 10ops-codfw, 10netops: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2481487 (10faidon) @RobH, try `show lldp neighbors` (with or without `| match ge-4` at the end).
[01:26:24] <icinga-wm>	 PROBLEM - MD RAID on mw1259 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:26:25] <icinga-wm>	 PROBLEM - SSH on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:28:14] <icinga-wm>	 RECOVERY - MD RAID on mw1259 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[01:28:16] <icinga-wm>	 RECOVERY - SSH on mw1259 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0)
[01:35:31] <grrrit-wm>	 (03PS3) 10Chad: WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 
[01:40:14] <grrrit-wm>	 (03PS4) 10Chad: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 
[01:40:38] <grrrit-wm>	 (03CR) 10Chad: [C: 031] "Yay https://puppet-compiler.wmflabs.org/3420/" [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad)
[01:41:22] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad)
[02:23:45] <grrrit-wm>	 (03CR) 10Krinkle: [C: 04-1] "404 Not Found /static/images/project-logos/tcywiki.png." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[02:25:25] <grrrit-wm>	 (03CR) 10Krinkle: "Please download a correctly sized rendering of the SVG logo in both 1x and 2x size, run through an optimiser (e.g. zopflipng, or ImageOpti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[02:30:58] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.10) (duration: 09m 33s)
[02:31:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:53:39] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 4 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2482696 (10aaron) 05Open>03Resolved According to [[ https://logstash.wikimedia....
[02:56:19] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 08m 57s)
[02:56:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:03:21] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jul 21 03:03:21 UTC 2016 (duration 7m 2s)
[03:03:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:11:25] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Add Bryan to labtest roots. [puppet] - 10https://gerrit.wikimedia.org/r/299959 (https://phabricator.wikimedia.org/T140830) (owner: 10Gehel)
[03:13:49] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Dzahn) 2 of 3 users are good to go now. We just need a wikitech user for "bcohn" to finalize this.
[03:29:39] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2472440 (10Dzahn) already removed from DNS in July 2015 and don't see anything in puppet either.  ---  commit 4c46ff39f1071816d8ed865d93d66daf3b3fc929 Author: jgr...
[03:31:16] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2482722 (10Dzahn) only mgmt dns is left, since cables have been removed.. we can remove that too
[03:32:48] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2482723 (10Dzahn) oh wait, you mean "aluminium.**frack.**eqiad.wmnet"  (too) right
[03:38:41] <mutante>	 /wmf/dns$ git rebase --continue
[03:38:41] <mutante>	 fatal: update_ref failed for ref 'refs/heads/master': cannot lock ref 'refs/heads/master': ref refs/heads/master is at afda28cb6ce31bf058b662cc352fef91029ab921 but expected b49609114be919f8129ac6c464dfcfdbc56c61f3
[03:38:45] <mutante>	 Successfully rebased and updated refs/heads/master.
[03:38:50] <mutante>	 fatal AND succesful.. yay
[03:39:21] <MaxSem>	 welcome to git!
[03:39:24] <MaxSem>	 http://latkin.org/blog/2016/07/20/git-for-windows-accidentally-creates-ntfs-alternate-data-streams/
[03:42:35] <mutante>	 MaxSem: lolwut
[03:44:05] <mutante>	 MaxSem: correction:  lol:wut  :)
[03:47:45] <grrrit-wm>	 (03PS1) 10Dzahn: remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) 
[03:48:17] <grrrit-wm>	 (03PS2) 10Dzahn: remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) 
[03:49:51] <mutante>	 wanted to add Jeff as reviewer in gerrit, typing J.. waiting for autocomplete hit enter, but i got "JavaScript" instead.. that adds like 20 unrelated people at once .. oops :)
[04:02:07] <wikibugs>	 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2459518 (10Dzahn) just some technical notes:  osmdb.eqiad.wmnet  is  an alias for  labsdb1006.eqiad.wmnet  cheat sheet for shp2pgs...
[04:19:42] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482774 (10Dzahn) copying verbatim comment from @Glaisher on T134017#2253719  ---  Could someone provide the translations for the namespace names? If po...
[04:31:42] <grrrit-wm>	 (03PS1) 10Dzahn: restbase: add new tcy.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) 
[04:35:17] <grrrit-wm>	 (03PS1) 10Dzahn: labs dnsrecursor: add tcy.wiki(pedia) [puppet] - 10https://gerrit.wikimedia.org/r/300215 (https://phabricator.wikimedia.org/T140898) 
[04:36:16] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482807 (10Dzahn)
[05:52:02] <elukey>	 (today I'll be afk :)
[06:07:28] <grrrit-wm>	 (03PS2) 10Dzahn: admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) 
[06:18:25] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482825 (10Dzahn) @AlexMonk-WMF is an Interwiki cache update like https://gerrit.wikimedia.org/r/#/c/286552/1 needed for this as well?
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:32] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:38] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482840 (10Dzahn) @Aude could we have a change like https://gerrit.wikimedia.org/r/#/c/288097/4 for "tcy"?
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:45] <wikibugs>	 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10greg) rOPUP:modules/toollabs/manifests/dev_environ.pp already has differences for what is installed and not just version, but software themselves (eg...
[06:34:18] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482844 (10Dzahn) also needed:  - messages (https://gerrit.wikimedia.org/r/#/c/286556/)  - database replica labs (DBA)
[06:34:43] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix up xmlstubs batch jobs setting for en wiki xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/300224 (https://phabricator.wikimedia.org/T132279) 
[06:36:57] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix up xmlstubs batch jobs setting for en wiki xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/300224 (https://phabricator.wikimedia.org/T132279) (owner: 10ArielGlenn)
[06:42:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:50:29] <apergos>	 _joe_: I'm seeing error: RPC failed; result=22, HTTP code = 503
[06:50:29] <apergos>	 fatal: The remote end hung up unexpectedly
[06:50:29] <apergos>	 for both strontium and rhodium on puppet-merge from palladium
[06:50:32] <apergos>	 any ideas?
[06:50:50] <apergos>	 I get the same hangup when I try puppet-merge from strontium, takes quite a while to fail in both cases
[06:51:15] <_joe_>	 apergos: no idea, that's clearly not related to my past work on puppet
[06:51:32] <_joe_>	 seems like gerrit issues tbh
[06:51:52] * apergos grumbles some
[06:51:59] <_joe_>	 I would look at what git does in puppet-merge
[06:52:09] <_joe_>	 then run it with at least GIT_TRACE=1
[06:52:23] <_joe_>	 sorry, gotta go run an errand in 2 minutes
[06:52:43] <apergos>	 see ya
[06:53:57] <_joe_>	 apergos: yeah it seems it's gerrit
[06:55:53] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:56:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:56:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:56:04] <icinga-wm>	 RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:04] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:04] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:04] <icinga-wm>	 RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:53] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:04:12] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:04:13] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:04:22] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:05:43] <greg-g>	 apergos: getting the same in beta code updates: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/113762/console
[07:05:50] <apergos>	 nice
[07:05:52] <greg-g>	 see 6:58
[07:06:13] <greg-g>	 I was just about to report a bug, but I should sleep (it's 00:06 here), can you?
[07:07:16] <apergos>	 still pulling from gerrit right?
[07:07:28] <apergos>	 not going to bug report it, going to try to kick it somehow and fix the issue
[07:07:38] <apergos>	 we can't have no puppet changes going in today, that's no good
[07:07:43] <apergos>	 go sleep, greg-g
[07:08:03] <greg-g>	 touche, see, I'm sleepy
[07:08:06] <greg-g>	 thanks
[07:33:27] <Trizek>	 Hi Operations. Can someone explain me that task and why it is important for general audience to know about it? https://phabricator.wikimedia.org/T86096
[07:38:05] <grrrit-wm>	 (03CR) 10KartikMistry: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem)
[07:42:23] <_joe_>	 Trizek: I guess you mean why it was tagged "user-notice"?
[07:42:49] <Trizek>	 Yes _joe_. 
[07:43:10] <_joe_>	 when changing the version of the ICU library hhvm is linked against, that would change the way some pages are rendered before we ran a script
[07:43:15] <Trizek>	 I need to understand what is it about to see how to include it on Tech News. 
[07:43:16] <_joe_>	 so users would notice the issue
[07:43:30] <_joe_>	 Trizek: it has happened 2 months ago or so?
[07:43:38] <Trizek>	 You are speaking Klingon to me, I'm afraid :)
[07:44:30] <_joe_>	 Trizek: so, one typical effect was https://phabricator.wikimedia.org/T136281
[07:44:41] <_joe_>	 we upgraded the application, and then had to run a script
[07:45:03] <_joe_>	 until that script finished running, some issues were observable on the wikis
[07:45:55] <_joe_>	 but again, all of that finished in may
[07:46:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[07:46:03] <Trizek>	 So this is now fixed?
[07:46:05] <_joe_>	 it's two months ago
[07:46:07] <_joe_>	 Trizek: yes
[07:46:44] <Trizek>	 So basically, it doesn't need to be announced. 
[07:46:47] <_joe_>	 isn't the ticket resolved?
[07:46:51] <_joe_>	 yes, no need
[07:46:52] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge.
[07:47:01] <apergos>	 !log restarted gerrit on ytterbium, it was refusing to complete git fetches for large repos (mw core, puppet...)
[07:47:03] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge.
[07:47:03] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[07:47:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:47:19] <wikibugs>	 06Operations, 07HHVM, 13Patch-For-Review: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2482909 (10Trizek-WMF)
[07:47:38] <Trizek>	 Tanks a lot for your explanations _joe_ !
[07:48:02] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[07:48:34] <icinga-wm>	 PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:51:40] <_joe_>	 Trizek: you're welcome :)
[07:54:42] <_joe_>	 I might break puppet in a bit
[07:54:50] <_joe_>	 as in breaking the puppetmaster
[07:55:02] <_joe_>	 uhm grrrt-wm is off as well
[07:55:05] <_joe_>	 let's kick it
[07:58:34] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 31 probes of 399 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[07:58:45] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] restbase: add new tcy.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn)
[08:03:05] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 (owner: 10Giuseppe Lavagetto)
[08:04:04] <_joe_>	  some puppet failures will be inevitable
[08:04:33] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 399 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[08:10:33] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2482928 (10Gehel)
[08:10:39] <_joe_>	 !log restarting apache on palladium
[08:10:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:11:53] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:14:03] <icinga-wm>	 RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:14:52] <icinga-wm>	 PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail
[08:15:02] <icinga-wm>	 PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: puppet fail
[08:15:03] <icinga-wm>	 PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: puppet fail
[08:15:03] <icinga-wm>	 PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 7 failures
[08:15:13] <icinga-wm>	 PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: puppet fail
[08:15:13] <icinga-wm>	 PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: puppet fail
[08:15:23] <icinga-wm>	 PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: Puppet has 6 failures
[08:15:23] <icinga-wm>	 PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Puppet has 17 failures
[08:15:32] <icinga-wm>	 PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 6 failures
[08:15:33] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 3 failures
[08:15:33] <icinga-wm>	 PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: puppet fail
[08:15:33] <icinga-wm>	 PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 10 failures
[08:15:34] <icinga-wm>	 PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: puppet fail
[08:15:42] <icinga-wm>	 PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: puppet fail
[08:15:43] <icinga-wm>	 PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: puppet fail
[08:15:43] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail
[08:15:52] <icinga-wm>	 PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Puppet has 9 failures
[08:15:54] <icinga-wm>	 PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 6 failures
[08:16:03] <icinga-wm>	 PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Puppet has 8 failures
[08:16:12] <icinga-wm>	 PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail
[08:16:12] <icinga-wm>	 PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 35 failures
[08:16:12] <icinga-wm>	 PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail
[08:16:13] <icinga-wm>	 PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: puppet fail
[08:16:13] <icinga-wm>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: puppet fail
[08:16:22] <icinga-wm>	 PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 10 failures
[08:16:22] <icinga-wm>	 PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: puppet fail
[08:16:23] <icinga-wm>	 PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 7 failures
[08:16:23] <icinga-wm>	 PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Puppet has 9 failures
[08:16:32] <icinga-wm>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: puppet fail
[08:16:32] <icinga-wm>	 PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Puppet has 11 failures
[08:16:33] <icinga-wm>	 PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail
[08:16:42] <icinga-wm>	 PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: Puppet has 4 failures
[08:16:44] <icinga-wm>	 PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 6 failures
[08:16:44] <icinga-wm>	 PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 6 failures
[08:16:52] <icinga-wm>	 PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 13 failures
[08:17:02] <icinga-wm>	 PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 9 failures
[08:17:13] <icinga-wm>	 PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 11 failures
[08:17:15] <_joe_>	 expected, I restarted apache on palladium
[08:17:33] <icinga-wm>	 PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 10 failures
[08:17:33] <icinga-wm>	 PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 9 failures
[08:17:43] <icinga-wm>	 PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 6 failures
[08:17:44] <icinga-wm>	 PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 6 failures
[08:21:45] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2482951 (10jcrespo) I had that very same problem with the old disk, but I assumed it was because it had failed. :-( Let me see if I see anything else bad.
[08:24:52] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#2482954 (10hashar) There is still role::parsoid::beta left over.  We probably want to audit what is left in puppet.git but afaik there is nothing left to do.
[08:37:43] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[08:40:25] <wikibugs>	 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2482979 (10hashar) I havent seen that occurr...
[08:40:53] <icinga-wm>	 RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[08:41:13] <icinga-wm>	 RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[08:41:22] <icinga-wm>	 RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[08:41:32] <icinga-wm>	 RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[08:41:32] <icinga-wm>	 RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:41:33] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:41:33] <icinga-wm>	 RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[08:41:42] <icinga-wm>	 RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[08:41:43] <icinga-wm>	 RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[08:41:43] <icinga-wm>	 RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[08:41:53] <icinga-wm>	 RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[08:42:02] <icinga-wm>	 RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:03] <icinga-wm>	 RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[08:42:03] <icinga-wm>	 RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[08:42:12] <icinga-wm>	 RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[08:42:13] <icinga-wm>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:13] <icinga-wm>	 RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[08:42:22] <icinga-wm>	 RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[08:42:22] <icinga-wm>	 RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[08:42:23] <icinga-wm>	 RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[08:42:23] <icinga-wm>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:32] <icinga-wm>	 RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:32] <icinga-wm>	 RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:42] <icinga-wm>	 RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[08:42:52] <icinga-wm>	 RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[08:42:52] <icinga-wm>	 RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:54] <icinga-wm>	 RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:54] <icinga-wm>	 RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:42:54] <icinga-wm>	 RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:03] <icinga-wm>	 RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:03] <icinga-wm>	 RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:04] <icinga-wm>	 RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:12] <icinga-wm>	 RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:33] <icinga-wm>	 RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:33] <icinga-wm>	 RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[08:43:33] <icinga-wm>	 RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:33] <icinga-wm>	 RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[08:43:42] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:43:43] <icinga-wm>	 RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:44:03] <icinga-wm>	 RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[08:44:14] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1027 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:4:1, 2I:4:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor Filippo Giunchedi waiting on replacement/diagnose, T140374
[08:44:14] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1027 is CRITICAL: CRITICAL: Active: 11, Working: 11, Failed: 1, Spare: 0 Filippo Giunchedi waiting on replacement/diagnose, T140374
[08:44:43] <icinga-wm>	 RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:47:28] <wikibugs>	 06Operations: reinstall snapshot1001.eqiad.wmnet with RAID, decomm snapshot1002,3,4 - https://phabricator.wikimedia.org/T140439#2483012 (10ArielGlenn) a:03ArielGlenn
[08:47:50] <wikibugs>	 06Operations: reinstall snapshot1001.eqiad.wmnet with RAID, decomm snapshot1002,3,4 - https://phabricator.wikimedia.org/T140439#2464872 (10ArielGlenn)
[08:48:43] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2483032 (10Gehel) p:05Triage>03Normal @BBlack, @ema: varnish is your domain, any opinion on this request for access? It seems that currently access to cp* servers is fairly restri...
[08:50:44] <icinga-wm>	 RECOVERY - Disk space on ms-be3004 is OK: DISK OK
[08:52:48] <wikibugs>	 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2483038 (10Gehel) p:05Triage>03Low Triaging this as low priority to match T117470.
[08:54:27] <grrrit-wm>	 (03CR) 10Nikerabbit: [C: 031] Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem)
[08:54:58] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2483042 (10Gehel) p:05Triage>03Normal
[08:55:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[09:00:48] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke)
[09:01:23] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: fix test vhost proxy auth [puppet] - 10https://gerrit.wikimedia.org/r/300234 
[09:04:15] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: fix test vhost proxy auth [puppet] - 10https://gerrit.wikimedia.org/r/300234 (owner: 10Giuseppe Lavagetto)
[09:04:20] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2483051 (10fgiunchedi) serpens still shows some memory growth, possibly not fixed yet  {F4293977}
[09:05:23] <icinga-wm>	 PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago
[09:07:22] <icinga-wm>	 RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:08:42] <icinga-wm>	 RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[09:12:37] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage: ms-be2017 failed disk - https://phabricator.wikimedia.org/T140948#2483058 (10fgiunchedi) 05Open>03Invalid I'm not seeing the errors reported in icinga for ms-be2027, I think this was ms-be1027 i.e. {T140374}
[09:13:14] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:13:27] <grrrit-wm>	 (03CR) 10MarcoAurelio: [C: 04-1] "If Visual Editor is to be enabled there, then the wiki should be added to dblists/visualeditordefault.dblist I think." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[09:14:03] <icinga-wm>	 PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:16:19] <grrrit-wm>	 (03PS3) 10Gehel: Configure new relevance forge servers [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) 
[09:19:13] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] "Reviewed with Erik, LVS will come as a second step. Looks good otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel)
[09:24:33] <icinga-wm>	 PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden
[09:25:53] <icinga-wm>	 PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: puppet fail
[09:26:42] <icinga-wm>	 RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.057 second response time
[09:27:04] <gehel>	 ^relforge is me... checking ...
[09:31:36] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: add thumbor service IPs [dns] - 10https://gerrit.wikimedia.org/r/300240 (https://phabricator.wikimedia.org/T139606) 
[09:33:03] <icinga-wm>	 PROBLEM - puppetmaster backend https on rhodium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden
[09:36:14] <grrrit-wm>	 (03PS1) 10Gehel: Adding rack information for new relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/300241 (https://phabricator.wikimedia.org/T137256) 
[09:38:33] <icinga-wm>	 RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[09:39:44] <icinga-wm>	 RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:39:48] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown: Provide a good  download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#2483113 (10ArielGlenn) How does http://dumps.wikimedia.your.org/  perform? I can ask them about their routing but I know all requests come to and are served from a h...
[09:42:29] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Adding rack information for new relforge servers [puppet] - 10https://gerrit.wikimedia.org/r/300241 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel)
[09:44:10] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2483121 (10abian) Today, this 503 error raises again with the corresponding URL (different diff and different oldid, but the same page)...  https://www.wikidata....
[09:47:51] <gehel>	 !log reinstalling and configuring relforge1001/1002 - T137256
[09:47:52] <stashbot>	 T137256: Setup two node elasticsearch cluster on relforge1001-1002 - https://phabricator.wikimedia.org/T137256
[09:47:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:49:38] <grrrit-wm>	 (03PS3) 10Addshore: RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) 
[09:51:15] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2483131 (10jcrespo) >>! In T140898#2482844, @Dzahn wrote: > also needed: >  > - messages (https://gerrit.wikimedia.org/r/#/c/286556/) >  > - database re...
[09:53:18] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2483135 (10jcrespo) I cannot do the first until the database is created. The second depend on this.
[09:55:02] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 
[10:16:30] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: lvs: add thumbor to lvs [puppet] - 10https://gerrit.wikimedia.org/r/300244 (https://phabricator.wikimedia.org/T139606) 
[10:23:10] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM for now, but please see my comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300244 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi)
[10:24:03] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 
[10:29:16] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix link to current set of cirrus search dumps [puppet] - 10https://gerrit.wikimedia.org/r/300246 (https://phabricator.wikimedia.org/T138176) 
[10:35:07] <paravoid>	 !log cr2-eqiad: increase cross-datacenter link OSPF metrics
[10:35:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:37:27] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 
[10:38:25] <grrrit-wm>	 (03PS2) 10ArielGlenn: fix link to current set of cirrus search dumps [puppet] - 10https://gerrit.wikimedia.org/r/300246 (https://phabricator.wikimedia.org/T138176) 
[10:40:50] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix link to current set of cirrus search dumps [puppet] - 10https://gerrit.wikimedia.org/r/300246 (https://phabricator.wikimedia.org/T138176) (owner: 10ArielGlenn)
[10:50:43] <paravoid>	 !log cr2-eqiad: deactivating IX BGP sessions
[10:50:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:52:50] <paravoid>	 !log cr2-eqiad: deactivating Transit BGP sessions
[10:52:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:53:51] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 
[10:55:12] <paravoid>	 !log cr2-eqiad: deactivating Fundraising BGP session
[10:55:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:59:54] <paravoid>	 !log cr2-eqiad: disabling IX/Transit/Fundraising interfaces
[10:59:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:00:04] <jouncebot>	 paravoid: Dear anthropoid, the time has come. Please deploy network maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1100).
[11:00:38] <icinga-wm>	 PROBLEM - Host lutetium is DOWN: PING CRITICAL - Packet loss = 100%
[11:01:08] <_joe_>	 I guess this is expected paravoid 
[11:01:09] <jynus>	 is that expected?
[11:03:06] <paravoid>	 it's not :/
[11:03:15] <paravoid>	 just one frack host?
[11:03:20] <_joe_>	 seems so
[11:03:45] <mark>	 mismatch of some acl or so?
[11:03:49] <mark>	 otoh, acls are mostly on the SRX
[11:03:57] <paravoid>	 I can ping it from neon, weird
[11:05:39] <paravoid>	 ah, it's its public IP, 208.80.155.13
[11:05:45] <paravoid>	 not lutetium.frack.eqiad.wmnet (10.64.40.111)
[11:05:46] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 106, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-15/0/0: down - cr2-eqiad:xe-5/0/3BR
[11:10:03] <paravoid>	 I don't see it
[11:10:11] <paravoid>	 all looks good really
[11:10:34] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 
[11:11:00] <mark>	 i can't even login on pfw1
[11:11:14] <mark>	 probably some stupid acl
[11:12:08] <paravoid>	 pretty sure it's just some NAT stupidity on the SRX
[11:12:15] <mark>	 probably
[11:13:22] <paravoid>	 lutetium sees the packet and replies
[11:13:26] <paravoid>	 11:13:23.946346 IP 10.64.40.111 > 208.80.154.14: ICMP echo reply, id 22481, seq 15, length 64
[11:14:23] <paravoid>	 all the rest works
[11:14:31] <paravoid>	 I'll proceed with the cr2-eqiad window
[11:14:35] <mark>	 ok
[11:14:55] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Change-Prop: Fix error ignoring config bug [puppet] - 10https://gerrit.wikimedia.org/r/300166 (owner: 10Ppchelko)
[11:15:10] <paravoid>	 !log cr2-eqiad: deactivate chassis redundancy graceful-switchover
[11:15:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:15:41] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-Prop: Fix error ignoring config bug [puppet] - 10https://gerrit.wikimedia.org/r/300166 (owner: 10Ppchelko)
[11:15:56] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Change-prop: Ignore bot edits on ORES precache updates. [puppet] - 10https://gerrit.wikimedia.org/r/300108 (owner: 10Ppchelko)
[11:16:06] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-prop: Ignore bot edits on ORES precache updates. [puppet] - 10https://gerrit.wikimedia.org/r/300108 (owner: 10Ppchelko)
[11:17:10] <_joe_>	 mobrovac: running puppet on scb*
[11:17:26] <mobrovac>	 kk, i;ll restart afterwards
[11:17:41] <_joe_>	 puppet has ran
[11:24:43] <paravoid>	 !log upgrading cr2-eqiad:re0 and rebooting
[11:24:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:27:16] <wikibugs>	 06Operations, 10Monitoring, 13Patch-For-Review: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#2483193 (10ema) @elukey : that's right, we're simply sending gauges instead of counters but the behavior of `Collector.derivative()` still needs to be investigated.
[11:28:05] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused)
[11:28:16] <icinga-wm>	 PROBLEM - configured eth on relforge1001 is CRITICAL: Connection refused by host
[11:28:16] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.21, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused)
[11:28:17] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on relforge1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused
[11:28:25] <icinga-wm>	 PROBLEM - MD RAID on relforge1001 is CRITICAL: Connection refused by host
[11:28:25] <icinga-wm>	 PROBLEM - salt-minion processes on relforge1001 is CRITICAL: Connection refused by host
[11:28:37] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on relforge1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused
[11:28:45] <icinga-wm>	 PROBLEM - NTP on relforge1001 is CRITICAL: NTP CRITICAL: No response from NTP server
[11:28:56] <icinga-wm>	 PROBLEM - dhclient process on relforge1001 is CRITICAL: Connection refused by host
[11:29:06] <icinga-wm>	 PROBLEM - Check size of conntrack table on relforge1001 is CRITICAL: Connection refused by host
[11:29:27] <mark>	 relforge?
[11:29:36] <icinga-wm>	 PROBLEM - Disk space on relforge1001 is CRITICAL: Connection refused by host
[11:29:45] <icinga-wm>	 PROBLEM - DPKG on relforge1001 is CRITICAL: Connection refused by host
[11:32:59] <godog>	 looks like a new host, silenced
[11:33:56] <grrrit-wm>	 (03PS1) 10Ema: cache_upload: do not set Access-Control-Allow-Origin twice [puppet] - 10https://gerrit.wikimedia.org/r/300249 
[11:34:07] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 (owner: 10MaxSem)
[11:38:27] <paravoid>	 !log cr2-eqiad: toggling mastership between routing-engines (re1->re0)
[11:38:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:40:46] <icinga-wm>	 PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197)
[11:43:05] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 106, down: 0, dormant: 0, excluded: 1, unused: 0
[11:43:57] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[11:44:16] <icinga-wm>	 RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms
[11:44:31] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#2483220 (10mobrovac) https://gerrit.wikimedia.org/r/#/c/300067/ addresses this. Will amend the commit to link it to this bug too.
[11:45:09] <grrrit-wm>	 (03PS2) 10Mobrovac: Parsoid: clean up the manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/300067 (https://phabricator.wikimedia.org/T90668) 
[11:45:15] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[11:49:16] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 106, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-15/0/0: down - cr2-eqiad:xe-5/0/3BR
[11:49:16] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms
[11:49:33] <paravoid>	 !log upgrading cr2-eqiad:re1 and rebooting
[11:49:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:51:35] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 97 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:53:27] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:54:17] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:57:36] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:59:07] <wikibugs>	 07Blocked-on-Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10AlexMonk-WMF) I saw it just a few...
[12:03:27] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10AlexMonk-WMF) >>! In T140898#2482825, @Dzahn wrote: > @AlexMonk-WMF is an Interwiki cache update like https://gerrit.wikimedia.org/r/#/c/2865...
[12:07:09] <icinga-wm>	 PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: puppet fail
[12:07:58] <Nikerabbit>	 logstash/kibana is not loading
[12:08:48] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[12:08:57] <aharoni>	 Oops!  Looks like something went wrong. Refreshing may do the trick. 
[12:08:59] <aharoni>	 but it doesn't
[12:10:10] <aharoni>	 OK, now it does
[12:14:09] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-mobrovac: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2483251 (10mobrovac)
[12:15:19] <paravoid>	 !log cr2-eqiad: setting "chassis network-services enhanced-ip" and rebooting re1 (then re0 will follow)
[12:15:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:19:12] <paravoid>	 !log cr2-eqiad: toggling mastership between routing-engines (re0->re1)
[12:19:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:21:47] <icinga-wm>	 PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197)
[12:23:58] <icinga-wm>	 RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.61 ms
[12:24:33] <icinga-wm>	 PROBLEM - Host text-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::1
[12:24:47] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[12:24:53] <paravoid>	 ugh
[12:25:27] <paravoid>	 just heavy packet loss on IPv6
[12:25:37] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[12:26:06] <icinga-wm>	 RECOVERY - Host text-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms
[12:26:47] <aharoni>	 Has there been an increase in frequency of "readonly" states on Wikimedia sites lately?
[12:26:59] <paravoid>	 no
[12:27:19] <paravoid>	 !log cr2-eqiad: rebooting backup RE (re0)
[12:27:23] <aharoni>	 I am checking what prevent people from publishing articles using Content Translation, and recently "readonly" has been very common.
[12:27:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:27:35] <aharoni>	 Let me show what exactly do I mean by "readonly":
[12:27:39] <paravoid>	 not a good time now, aharoni
[12:27:48] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[12:28:14] <aharoni>	 OK :)
[12:28:28] <paravoid>	 sorry, in the middle of a complicated upgrade
[12:30:29] <icinga-wm>	 PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24
[12:30:36] <wikibugs>	 06Operations, 10ops-eqiad, 10hardware-requests: decommission WMF3155-WMF3175 (old lsearchd) - https://phabricator.wikimedia.org/T140372#2483260 (10Cmjohnson)
[12:31:18] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 65 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:31:36] <wikibugs>	 06Operations, 10hardware-requests: eqiad out of warranty spares to decommission - approval request - https://phabricator.wikimedia.org/T120679#2483262 (10Cmjohnson)
[12:31:38] <wikibugs>	 06Operations, 10ops-eqiad, 10hardware-requests: decommission WMF3155-WMF3175 (old lsearchd) - https://phabricator.wikimedia.org/T140372#2462603 (10Cmjohnson) 05Open>03Resolved
[12:32:08] <Reedy>	 aharoni: what other channels are you in that are relevant? :P
[12:32:09] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:33:10] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2483265 (10Cmjohnson) 05Open>03Resolved db1058 has been removed from rack
[12:34:17] <icinga-wm>	 RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[12:35:06] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:35:06] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#2483267 (10Cmjohnson) 05Open>03Resolved This was completed...all servers have been removed from racks and decommissioned.
[12:35:16] <icinga-wm>	 RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:36:07] <mobrovac>	 !log change-prop deploying b7079fd9c
[12:36:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:37:26] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 14 probes of 243 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:37:47] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:39:01] <zeljkof>	 gehel: long time no see :)
[12:39:08] <zeljkof>	 can not connect to production now :(
[12:39:16] <gehel>	 zeljkof: that much?
[12:39:29] <zeljkof>	 ssh config https://github.com/zeljkofilipin/dotfiles/blob/master/.ssh/config
[12:40:10] <zeljkof>	 a couple of terminal outputs 
[12:40:11] <zeljkof>	 https://phabricator.wikimedia.org/P3534
[12:40:16] <zeljkof>	 https://phabricator.wikimedia.org/P3535
[12:41:09] <mobrovac>	 !log citoid deployed 5134e49e
[12:41:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:41:47] <_joe_>	 zeljkof: https://github.com/zeljkofilipin/dotfiles/blob/master/.ssh/config#L20
[12:41:55] <gehel>	 zeljkof: could it be that your ssh key also needs to be updated?
[12:41:59] <gehel>	 zeljkof: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L1555
[12:42:19] <gehel>	 zeljkof: I need to run to the doctor in a few minutes...
[12:42:28] <_joe_>	 zeljkof: ProxyCommand None
[12:42:32] <zeljkof>	 gehel: :) I think the key is fine now, but I will double check
[12:42:33] <_joe_>	 with no comment afterwards
[12:42:42] <_joe_>	 None != none IIRC
[12:42:45] <mobrovac>	 yup
[12:42:56] <zeljkof>	 _joe_: hashar said the same thing, I have copy/pasted it from docs :|
[12:42:58] <mobrovac>	 ssh is sensitive to cases
[12:43:07] <icinga-wm>	 PROBLEM - Host dbproxy1002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:43:09] <zeljkof>	 will fix and try again
[12:43:17] <hashar>	 that is 'none'
[12:43:18] <_joe_>	 jynus: ^^
[12:43:23] <_joe_>	 known?
[12:43:31] <hashar>	 err wrong window sorry
[12:43:43] * zeljkof is doing the needful
[12:47:44] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2483278 (10BBlack) 05Open>03declined Yes, outside of global roots, access to any of the caches is pretty tightly restricted.  It's not just based on needs, but also other stabilit...
[12:48:46] <jynus>	 if dbproxy is down, gerrit and otrs are down among others
[12:48:53] <paravoid>	 !log cr2-eqiad: fixing IPv6 VRRP interoperatbility between the cr1/cr2 ( http://www.juniper.net/documentation/en_US/junos14.2/topics/concept/vrrpv3-junos-support.html )
[12:48:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:49:08] <jynus>	 which means there is an ongoing outage
[12:49:11] <paravoid>	 !log cr2-eqiad: re-enabling GRES and toggling mastership between routing-engines (re1->re0)
[12:49:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:53:48] <zeljkof>	 hashar _joe_ gehel mobrovac: works now!
[12:53:59] <zeljkof>	 minor tweaks were needed in the docs https://wikitech.wikimedia.org/w/index.php?title=Production_shell_access&type=revision&diff=773056&oldid=763132
[12:54:06] <zeljkof>	 thanks everybody
[12:54:44] <zeljkof>	 (relevant changes are in ssh config, hashar made a few text style changes too)
[12:54:57] <zeljkof>	 looks like inline comments in ssh config were causing trouble
[12:55:10] <hashar>	 entirely my fault
[12:55:38] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[12:55:38] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[12:56:50] <zeljkof>	 hashar: so you were the one that added the inline comments? :D
[12:56:57] <hashar>	 yup
[12:57:09] <hashar>	 without even testing it / reading the ssh_config doc about  comments
[12:58:04] <godog>	 !log manually flipping m2-master to db1020
[12:58:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:58:26] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:01:28] <godog>	 !log bounce gerrit on ytterbium
[13:01:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:01:40] <stephanebisson>	 anyone else getting an error page from gerrit?
[13:01:57] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:01:57] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:02:07] <godog>	 stephanebisson: should be gone now
[13:02:22] <stephanebisson>	 godog: yep, thanks!
[13:02:46] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:05:38] <godog>	 heh the gerrit bot doesn't survive gerrit outages apparently, anyways dns change is https://gerrit.wikimedia.org/r/#/c/300254/1
[13:10:00] <paravoid>	 !log cr2-eqiad: setting "chassis state cb-upgrade on" and powering off re1 (backup)
[13:10:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:10:30] <paravoid>	 !log cr2-eqiad: setting fabric plane 4 to offline
[13:10:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:10:57] <paravoid>	 !log cr2-eqiad: setting fabric plane 5/6/7 to offline
[13:11:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:11:27] <bblack>	 godog: should merge it now, IMHO
[13:11:48] <bblack>	 (or we risk blocking on it or accidentally reverting it if we need to make a quick DNS commit during network maint)
[13:12:35] <Krenair>	 godog, yeah, restarting that bot
[13:12:35] <paravoid>	 !log cr2-eqiad: setting scb 1 to offline and replacing it
[13:12:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:12:42] <godog>	 bblack: indeed! merged, I'll run authdns-update too
[13:13:25] <Krenair>	 (instructions for it are at https://wikitech.wikimedia.org/wiki/Grrrit-wm#Building.2FDeploying )
[13:13:52] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483315 (10jcrespo)
[13:13:57] <icinga-wm>	 RECOVERY - Host dbproxy1002 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms
[13:14:10] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483328 (10jcrespo) ``` MariaDB MISC m2 localhost (none) > SHOW DATABASES; +--------------------+ | Database           | +--------------------+ | bugzilla_testing   | | frimpressions      | | heartbeat          | |...
[13:14:19] <godog>	 Krenair: kk, thanks
[13:14:24] <godog>	 ah it is back
[13:17:01] <grrrit-wm>	 (03PS1) 10Yuvipanda: shinken: Use new labs graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/300265 (https://phabricator.wikimedia.org/T140976) 
[13:18:07] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: Connection refused by host
[13:18:17] <icinga-wm>	 PROBLEM - configured eth on dbproxy1002 is CRITICAL: Connection refused by host
[13:18:21] <mobrovac>	 !log mathoid deploying 36be4ea
[13:18:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:18:28] <icinga-wm>	 PROBLEM - dhclient process on dbproxy1002 is CRITICAL: Connection refused by host
[13:18:46] <icinga-wm>	 PROBLEM - DPKG on dbproxy1002 is CRITICAL: Connection refused by host
[13:18:48] <icinga-wm>	 PROBLEM - Disk space on dbproxy1002 is CRITICAL: Connection refused by host
[13:18:48] <icinga-wm>	 PROBLEM - haproxy process on dbproxy1002 is CRITICAL: Connection refused by host
[13:18:58] <icinga-wm>	 PROBLEM - MD RAID on dbproxy1002 is CRITICAL: Connection refused by host
[13:19:07] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1002 is CRITICAL: Connection refused by host
[13:19:08] <icinga-wm>	 PROBLEM - salt-minion processes on dbproxy1002 is CRITICAL: Connection refused by host
[13:19:17] <icinga-wm>	 PROBLEM - haproxy alive on dbproxy1002 is CRITICAL: Connection refused by host
[13:19:17] <icinga-wm>	 PROBLEM - MPT RAID on dbproxy1002 is CRITICAL: Connection refused by host
[13:19:55] <godog>	 YuviPanda: nice, thanks for working on grafana-labs ! did you play with prometheus-tools already?
[13:20:15] <YuviPanda>	 godog yup, just added it as a data source! 
[13:20:39] <YuviPanda>	 godog but I can't get graphite added, am completing the migration of labs graphite to graphite-labs.wikimedia.org (behind misc varnish now) before trying it again
[13:21:44] <YuviPanda>	 godog have you played with prometheus expression language? I've a few questions
[13:21:51] <godog>	 YuviPanda: a bit yeah
[13:23:46] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms
[13:24:46] <icinga-wm>	 PROBLEM - Host dbproxy1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:25:07] <YuviPanda>	 godog ok, I'll dig around some more and poke you with questions :)
[13:26:07] <icinga-wm>	 RECOVERY - Host dbproxy1002 is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms
[13:26:34] <godog>	 YuviPanda: hehe ok, let me know if you can add graphite too
[13:26:50] <YuviPanda>	 godog will do!
[13:28:47] <paravoid>	 !log cr2-eqiad: toggling mastership between routing-engines (re0->re1)
[13:30:07] <paravoid>	 !log cr2-eqiad: powering off re0 (backup)
[13:30:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:30:20] <paravoid>	 did it lose one?
[13:30:31] <paravoid>	 SAL says no, weird
[13:31:04] <grrrit-wm>	 (03Abandoned) 10Ema: cache_upload: do not set Access-Control-Allow-Origin twice [puppet] - 10https://gerrit.wikimedia.org/r/300249 (owner: 10Ema)
[13:31:16] <paravoid>	 !log cr2-eqiad: setting fabric plane 0/1/2/3 to offline
[13:31:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:31:42] <paravoid>	 !log cr2-eqiad: setting scb 0 to offline and replacing it
[13:31:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:37:45] <grrrit-wm>	 (03CR) 10Ottomata: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey)
[13:38:31] <paravoid>	 !log cr2-eqiad: toggling mastership between routing-engines (re1->re0)
[13:39:21] <grrrit-wm>	 (03CR) 10Ottomata: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey)
[13:40:41] <paravoid>	 !log cr2-eqiad: fabric upgrade bandwidth for FPC 4/5
[13:41:09] <icinga-wm>	 PROBLEM - Disk space on es2001 is CRITICAL: Timeout while attempting connection
[13:41:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:43:00] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[13:43:49] <icinga-wm>	 PROBLEM - Host cr1-eqord is DOWN: PING CRITICAL - Packet loss = 100%
[13:43:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.110, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:43:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:43:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.80, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:43:51] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec
[13:43:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200)
[13:43:59] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec
[13:44:19] <icinga-wm>	 PROBLEM - restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.149, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:44:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:44:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.134, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:45:00] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[13:45:01] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 504 (exp
[13:45:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down!
[13:45:16] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is CRITICAL: Connection refused
[13:45:16] <icinga-wm>	 PROBLEM - Restbase root url on restbase1010 is CRITICAL: Connection refused
[13:45:16] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=restbase.svc.eqiad.wmnet, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by class socket.error: [Errno 111] Connection refused)
[13:45:17] <icinga-wm>	 PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:45:19] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[13:45:19] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[13:45:28] <_joe_>	 wat?
[13:45:30] <icinga-wm>	 PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 2 failures
[13:45:31] <icinga-wm>	 PROBLEM - Restbase root url on restbase1015 is CRITICAL: Connection refused
[13:45:31] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.147, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:45:40] <icinga-wm>	 PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 37 failures
[13:45:43] <apergos>	 what's happening?
[13:45:44] <_joe_>	 what the hell happened?
[13:45:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down!
[13:45:58] <_joe_>	 going to take a look on one of the rb machines
[13:46:00] <icinga-wm>	 RECOVERY - Host cr1-eqord is UP: PING OK - Packet loss = 0%, RTA = 43.53 ms
[13:46:00] <icinga-wm>	 PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: puppet fail
[13:46:00] <icinga-wm>	 PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 36 failures
[13:46:01] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200)
[13:46:01] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200)
[13:46:09] <icinga-wm>	 PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:46:09] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0]
[13:46:11] <icinga-wm>	 PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: puppet fail
[13:46:11] <icinga-wm>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: puppet fail
[13:46:14] <apergos>	 cr1-eqord down. er?
[13:46:19] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0]
[13:46:19] <icinga-wm>	 PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: puppet fail
[13:46:20] <icinga-wm>	 PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused
[13:46:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase1008.eqiad.wmnet because of too many down!
[13:46:26] <paravoid>	 no it's not
[13:46:29] <icinga-wm>	 PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail
[13:46:29] <icinga-wm>	 PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 2 failures
[13:46:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.133, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:46:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.79, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:46:31] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200)
[13:46:31] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.200, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[13:46:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200)
[13:46:32] <_joe_>	 this is restbase
[13:46:39] <icinga-wm>	 PROBLEM - Restbase root url on restbase1013 is CRITICAL: Connection refused
[13:46:40] <icinga-wm>	 PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused
[13:46:50] <icinga-wm>	 PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 6 failures
[13:46:57] <grrrit-wm>	 (03CR) 10Jgreen: [C: 031] remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) (owner: 10Dzahn)
[13:46:58] <apergos>	 (04:43:49 μμ) icinga-wm: PROBLEM - Host cr1-eqord is DOWN: PING CRITICAL - Packet loss = 100%      icinga thinks/thought so
[13:46:59] <icinga-wm>	 PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 2 failures
[13:46:59] <icinga-wm>	 PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Puppet has 2 failures
[13:47:00] <icinga-wm>	 PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:47:10] <icinga-wm>	 PROBLEM - Restbase root url on restbase1012 is CRITICAL: Connection refused
[13:47:10] <icinga-wm>	 PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:47:10] <icinga-wm>	 PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:47:11] <icinga-wm>	 PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:47:11] <icinga-wm>	 PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:47:11] <icinga-wm>	 PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 39 failures
[13:47:11] <icinga-wm>	 PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:47:11] <icinga-wm>	 PROBLEM - Restbase root url on restbase1014 is CRITICAL: Connection refused
[13:47:12] <icinga-wm>	 PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused
[13:47:19] <icinga-wm>	 PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:47:20] <icinga-wm>	 PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:47:20] <icinga-wm>	 PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 29 failures
[13:47:21] <icinga-wm>	 PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:47:21] <icinga-wm>	 PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 2 failures
[13:47:29] <icinga-wm>	 PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:47:29] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:47:29] <icinga-wm>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: puppet fail
[13:47:40] <icinga-wm>	 PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Puppet has 22 failures
[13:47:41] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 4 failures
[13:47:49] <icinga-wm>	 PROBLEM - Redis status tcp_6381 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6381
[13:47:50] <icinga-wm>	 PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 2 failures
[13:47:55] <_joe_>	 shit
[13:48:00] <icinga-wm>	 PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 7 failures
[13:48:00] <icinga-wm>	 PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:48:08] <_joe_>	 ok for restbase something really strange is happening
[13:48:10] <icinga-wm>	 PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:48:10] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy
[13:48:20] <mobrovac>	 what is goiing on here???
[13:48:22] <YuviPanda>	 godog https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-pods :D enter name of any tool in template for stats! (example: geohack / xtools-articleinfo)
[13:48:30] <icinga-wm>	 RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.073 second response time
[13:48:30] <icinga-wm>	 PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 4 failures
[13:48:37] <bblack>	 at least two indepedent problems, probably
[13:48:39] <icinga-wm>	 PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[13:48:39] <icinga-wm>	 PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:48:40] <icinga-wm>	 PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 3 failures
[13:48:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[13:48:41] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy
[13:49:00] <icinga-wm>	 PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused
[13:49:02] <YuviPanda>	 godog however, it doesn't show up in https://grafana-labs.wikimedia.org/dashboard/db/kubernetes-pods
[13:49:11] <icinga-wm>	 PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 6 failures
[13:49:15] <_joe_>	 I am looking at restbase
[13:49:20] <icinga-wm>	 RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.018 second response time
[13:49:21] <icinga-wm>	 RECOVERY - Restbase root url on restbase1010 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.008 second response time
[13:49:29] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[13:49:29] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[13:49:30] <mobrovac>	 damn, it's seems like https://phabricator.wikimedia.org/T136957 mass-happened on RB
[13:49:36] <mobrovac>	 damn
[13:49:40] <icinga-wm>	 RECOVERY - Redis status tcp_6381 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6381 has 1 databases (db0) with 9754281 keys - replication_delay is 0
[13:49:43] * mobrovac restarting RB
[13:49:50] <icinga-wm>	 PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused
[13:49:56] <_joe_>	 mobrovac: I am doing it
[13:50:06] <_joe_>	 coordinate 
[13:50:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[13:50:19] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[13:50:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[13:50:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[13:50:35] <mobrovac>	 k _joe_
[13:50:49] <icinga-wm>	 RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.070 second response time
[13:50:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[13:51:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[13:51:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[13:51:20] <icinga-wm>	 RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.013 second response time
[13:51:21] <icinga-wm>	 RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.016 second response time
[13:51:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy
[13:51:34] <_joe_>	 done
[13:51:37] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.010 second response time
[13:51:49] <_joe_>	 7 minutes of outage
[13:51:51] <icinga-wm>	 RECOVERY - Restbase root url on restbase1015 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.017 second response time
[13:51:57] <_joe_>	 just because I didn't trust my guts :/
[13:52:17] <_joe_>	 mobrovac: about that ticket
[13:52:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[13:52:25] <wikibugs>	 06Operations, 10RESTBase, 06Services, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483364 (10mobrovac) This mass-happened today:  ``` (15:43:50) icinga-wm: PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connectio...
[13:52:27] <bblack>	 well there's two problems in the spam above: whatever happened with RB shutdown, and a network blip causing a spam of puppetfail
[13:52:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[13:52:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[13:52:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[13:52:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[13:52:48] <mark>	 one might have triggered the other
[13:53:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[13:53:21] <icinga-wm>	 PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:53:25] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1469105590309&to=1469109190309&var-site=eqiad&var-cache_type=%24__all&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5
[13:53:28] <grrrit-wm>	 (03PS1) 10Rush: check_legal: mobile privacy reference is now explicitly https [puppet] - 10https://gerrit.wikimedia.org/r/300272 
[13:53:34] <bblack>	 ^ shows the dip in eqiad traffic from the public
[13:53:34] <_joe_>	 mobrovac: I see now restbase doesn't have "Restart: always"
[13:53:54] <_joe_>	 bblack: we clearly had a network isse
[13:53:57] <_joe_>	 *issue
[13:54:06] <_joe_>	 and that might have crashed restbase
[13:54:17] <_joe_>	 but the real issue that caused such a long outage is
[13:54:45] <bblack>	 yes, it's possible that the mysterious RB outages are just hypersensitivity to network blips
[13:54:51] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:54:52] <_joe_>	 the systemd unit having a issue
[13:55:00] <icinga-wm>	 PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:55:03] <bblack>	 and yeah, systemd should have some sane service-restart config
[13:55:10] <icinga-wm>	 PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2009_v4, cp2021_v4
[13:55:30] <icinga-wm>	 PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2018_v4, cp2025_v4
[13:55:30] <icinga-wm>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2018_v4
[13:55:44] <grrrit-wm>	 (03CR) 10Rush: [C: 032] check_legal: mobile privacy reference is now explicitly https [puppet] - 10https://gerrit.wikimedia.org/r/300272 (owner: 10Rush)
[13:55:59] <icinga-wm>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2006_v4, cp2012_v4
[13:56:50] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:56:58] <_joe_>	 mobrovac: it seems you chose to shoot yourself in the foot with @init_restart = false
[13:57:16] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups  'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2483375 (10Jgreen) > You will have to configure your ssh client to connect via the bastion hosts to any servers in our interna...
[13:57:40] <icinga-wm>	 RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 24 ESP OK
[13:57:51] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:59:19] <icinga-wm>	 RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
[13:59:22] <grrrit-wm>	 (03CR) 10Jgreen: [C: 031] admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn)
[13:59:41] <icinga-wm>	 RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.041 second response time
[13:59:47] <wikibugs>	 06Operations, 10RESTBase, 06Services, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483376 (10Joe) Any production service running on systemd and not having  ``` Restart=always ```  is a large liability as shown by the outage we just experienced.  This be...
[13:59:49] <icinga-wm>	 RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK
[14:00:11] <grrrit-wm>	 (03PS2) 10Yuvipanda: shinken: Use new labs graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/300265 (https://phabricator.wikimedia.org/T140976) 
[14:00:20] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:00:22] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: Use new labs graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/300265 (https://phabricator.wikimedia.org/T140976) (owner: 10Yuvipanda)
[14:01:30] <icinga-wm>	 PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures
[14:02:00] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: restbase: have systemd restart failed nodes [puppet] - 10https://gerrit.wikimedia.org/r/300275 (https://phabricator.wikimedia.org/T136957) 
[14:02:08] <grrrit-wm>	 (03PS1) 10Mobrovac: RESTBase: disable firejail [puppet] - 10https://gerrit.wikimedia.org/r/300276 (https://phabricator.wikimedia.org/T136957) 
[14:02:18] <grrrit-wm>	 (03PS2) 10Chad: Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 
[14:02:32] <_joe_>	 godog, mobrovac since you're the two making the call on not having Restart=always in RB
[14:02:40] <paravoid>	 !log cr2-eqiad: disabling all asw-*-eqiad interfaces
[14:02:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:02:48] <_joe_>	 please review https://gerrit.wikimedia.org/r/300275
[14:03:02] <icinga-wm>	 RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.048 second response time
[14:04:09] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We still have no proof it's firejail killing restbase and we have no idea of a root cause." [puppet] - 10https://gerrit.wikimedia.org/r/300276 (https://phabricator.wikimedia.org/T136957) (owner: 10Mobrovac)
[14:04:12] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:04:21] <icinga-wm>	 RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK
[14:04:22] <icinga-wm>	 RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.016 second response time
[14:04:23] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[14:04:31] <paravoid>	 !log cr2-eqiad: disabling xe-5/2/3 (link to cr2-codfw)
[14:04:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:04:48] <paravoid>	 !log cr2-eqiad: disabling xe-4/2/0 (link to cr1-eqord)
[14:04:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:05:16] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 (owner: 10Chad)
[14:05:23] <wikibugs>	 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483391 (10mobrovac) >>! In T136957#2483376, @Joe wrote: > Any production service running on systemd and not having >  > ``` > Restart=always > ``` >...
[14:05:43] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[14:05:47] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 (owner: 10Chad)
[14:06:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 67, down: 4, dormant: 0, excluded: 0, unused: 0BRae3: down - Core: asw-c-eqiad:ae2BRae4: down - Core: asw-d-eqiad:ae2BRae1: down - Core: asw-a-eqiad:ae2BRae2: down - Core: asw-b-eqiad:ae2BR
[14:07:02] <wikibugs>	 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483406 (10Joe) Crashes happen.  We need to be able to survive a mass crash (we can on the appservers precisely because upstart restarts the services...
[14:07:07] <paravoid>	 !log cr2-eqiad: halting both routing engines(!)
[14:07:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:07:36] <_joe_>	 mobrovac: seriously, explain me why it's a good idea not to restart restbase when it stops without a human telling it to stop
[14:07:51] <icinga-wm>	 RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 4.27 ms
[14:07:57] <_joe_>	 because I can't find a good reason not to
[14:09:27] <wikibugs>	 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2483407 (10mobrovac) From https://gerrit.wikimedia.org/r/#/c/300276/ by @Joe:  > We still have no proof it's firejail killing restbase and we have no...
[14:09:51] <icinga-wm>	 RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[14:09:53] <mobrovac>	 _joe_: why do we want to tolerate a service being killed?
[14:10:04] <_joe_>	 mobrovac: because we want to serve users?
[14:10:11] <_joe_>	 it's not like you don't get to know it
[14:10:13] <_joe_>	 it's logged
[14:10:21] <icinga-wm>	 RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[14:10:22] <icinga-wm>	 RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[14:10:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR
[14:10:42] <icinga-wm>	 RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[14:10:45] <_joe_>	 I mean give me a reason why systemd should not restart rb when it fails
[14:10:51] <icinga-wm>	 PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197)
[14:10:52] <icinga-wm>	 RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:10:53] <_joe_>	 which is not "we would not notice"
[14:11:01] <icinga-wm>	 RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[14:11:02] <_joe_>	 because if you intend to, you will
[14:11:11] <icinga-wm>	 PROBLEM - Host cr2-eqiad IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ffff::2
[14:11:29] <grrrit-wm>	 (03PS5) 10Chad: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 
[14:11:31] <grrrit-wm>	 (03PS1) 10Chad: Gerrit: Store the ssh_host_key in private puppet secrets [puppet] - 10https://gerrit.wikimedia.org/r/300279 
[14:11:32] <icinga-wm>	 RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[14:11:32] <icinga-wm>	 RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[14:11:32] <icinga-wm>	 RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:11:42] <icinga-wm>	 RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:11:44] <grrrit-wm>	 (03PS2) 10Chad: Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 
[14:11:51] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 (owner: 10Chad)
[14:11:52] <icinga-wm>	 RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:11] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:12:11] <mobrovac>	 _joe_: as you may have gotten from the ticket, i don't think that's restbase failing, but rather firejail killing it
[14:12:12] <icinga-wm>	 RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:12] <icinga-wm>	 RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[14:12:12] <icinga-wm>	 RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[14:12:21] <icinga-wm>	 RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[14:12:22] <icinga-wm>	 RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:23] <icinga-wm>	 RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:24] <mobrovac>	 _joe_: which a different problem
[14:12:31] <icinga-wm>	 RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:31] <icinga-wm>	 RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:32] <icinga-wm>	 RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:32] <icinga-wm>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[14:12:33] <mobrovac>	 _joe_: i do agree that the outcome for users is the same
[14:12:46] <_joe_>	 mobrovac: I think you got it wrong, but even if it was, still explain me why Restart=always is a bad idea
[14:12:56] <icinga-wm>	 RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:57] <icinga-wm>	 RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:57] <icinga-wm>	 RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:12:58] <icinga-wm>	 RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[14:13:06] <icinga-wm>	 RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:13:09] <_joe_>	 I think your problem is that firejail translates rb crash exit codes to 0
[14:13:16] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 (owner: 10Chad)
[14:13:25] <_joe_>	 which makes systemd without restart=always NOT restart the service
[14:13:27] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad)
[14:13:38] <icinga-wm>	 RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:13:38] <_joe_>	 but I just gave a quick look
[14:13:46] <icinga-wm>	 RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:13:47] <icinga-wm>	 RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:14:06] <icinga-wm>	 RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:14:08] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 5, dormant: 0, excluded: 0, unused: 0BRxe-5/3/0: down - Core: cr2-eqiad:xe-5/3/0 {#2651} [10Gbps DF]BRxe-4/3/0: down - Core: cr2-eqiad:xe-4/3/0 {#3456} [10Gbps DF]BRae0: down - Core: cr2-eqiad:ae0BRae0.0: down - BRxe-5/2/0: down - Core: cr2-eqiad:xe-5/2/0 {#1983} [10Gbps DF]BR
[14:14:27] <icinga-wm>	 RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:14:27] <logmsgbot>	 !log demon@tin Synchronized wmf-config/: extension list cleanups (duration: 00m 34s)
[14:14:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:14:37] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:14:46] <icinga-wm>	 RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:14:57] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:15:07] <icinga-wm>	 RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[14:15:20] <_joe_>	 mobrovac: I'm not saying we should not disable firejail, just I'd want a bit more evidence that's the issue here
[14:15:27] <icinga-wm>	 RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:15:39] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] Gerrit: Store the ssh_host_key in private puppet secrets [puppet] - 10https://gerrit.wikimedia.org/r/300279 (owner: 10Chad)
[14:15:47] <icinga-wm>	 RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:15:47] <icinga-wm>	 RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[14:16:32] <godog>	 I think default is Restart=no which doesn't restart in any case
[14:16:54] <_joe_>	 godog: nope
[14:17:57] <icinga-wm>	 RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK
[14:18:09] <_joe_>	 still, I need a good reason NOT to enable that
[14:18:46] <icinga-wm>	 RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.69 ms
[14:19:24] <godog>	 _joe_: double check, restart=no is the systemd default
[14:19:30] <_joe_>	 godog: yep
[14:19:37] <icinga-wm>	 RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:19:53] <_joe_>	 we seriously want all our prod user-facing apps to restart=always
[14:19:58] <icinga-wm>	 RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:20:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0
[14:20:53] <godog>	 anyways let's figure out what happened first and then what to do with restart behaviour
[14:20:57] <icinga-wm>	 PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: puppet fail
[14:21:03] <mobrovac>	 _joe_: godog: ok, historical context, when we set restart=no for RB, the problem was systemd restarting it continuously after failed starts (where RB wouldn't even be able to start up in the first place)
[14:21:14] <mobrovac>	 that was relevant at the time
[14:21:21] <mobrovac>	 because we had schema changes going on
[14:21:23] <_joe_>	 mobrovac: there is an option to limit that
[14:21:44] <_joe_>	 and I think we use it?
[14:21:52] <mobrovac>	 to limit what?
[14:22:04] <_joe_>	 the rate at which a service will be restarted
[14:22:38] <godog>	 also it should have been restart=on-failure, no?
[14:22:42] <mobrovac>	 RestartSec
[14:22:53] <_joe_>	 godog: not really
[14:22:56] <mobrovac>	 we have it set at 2
[14:23:22] <_joe_>	 godog: actually, if we wanted to get fancy, we could build into service-runner a systemd notifier
[14:23:36] <icinga-wm>	 RECOVERY - Host cr2-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 3.26 ms
[14:24:53] <mobrovac>	 so systemd will restart a service at max once in 2 seconds
[14:25:16] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2483419 (10Paladox) >>! In T140894#2482428, @demon wrote: > Let's do this tomorrow morning maybe?  :)
[14:26:33] <paravoid>	 !log cr2-eqiad: reenabling all asw-*-eqiad interfaces
[14:26:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:27:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 192, down: 0, dormant: 0, excluded: 0, unused: 0
[14:29:59] <paravoid>	 !log cr2-eqiad: reenabling xe-4/2/0 (link to cr1-eqord) and xe-5/2/3 (link to cr2-codfw)
[14:30:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:31:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[14:32:13] <mobrovac>	 godog: unrelated, but i see in rb1008 syslog java.net.UnknownHostException: graphite1003.eqiad.wmnet
[14:32:20] <mobrovac>	 from the metrics collector
[14:32:24] <mobrovac>	 from an hour ago
[14:33:58] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: Move labs graphite to graphite-labs.wikimedia.org - https://phabricator.wikimedia.org/T140899#2483436 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done and left redirects in place!
[14:34:00] <_joe_>	 yeah there was some dns failure around that time
[14:34:06] <_joe_>	 mobrovac: what time exactly?
[14:34:51] <mobrovac>	 _joe_: same time as the RB failure - 13:41:12
[14:35:03] <_joe_>	 yeah same time of the dns failures on the puppetmaster
[14:35:18] <_joe_>	 mobrovac: I think this is strictly related to this rb crash tbh
[14:35:33] <mobrovac>	 this == dns failure?
[14:35:36] <_joe_>	 yes
[14:35:49] <_joe_>	 to the rb crash
[14:35:53] <paravoid>	 !log cr2-eqiad: enabling Fundraising interface & BGP
[14:35:54] <mark>	 Jul 21 13:40:54  re0.cr2-eqiad alarmd[2899]: Alarm cleared: CB color=RED, class=CHASSIS, reason=CB fabric links require upgrade/training
[14:35:54] <mark>	 Jul 21 13:40:54  re0.cr2-eqiad craftd[1672]: Major alarm cleared, CB fabric links require upgrade/training
[14:35:56] <mobrovac>	 hm
[14:35:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:36:07] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0
[14:36:17] <_joe_>	 I am going to take a pause
[14:36:25] <_joe_>	 it's been a stressful 2 hours
[14:37:05] <paravoid>	 !log cr2-eqiad: reenabling Transit interfaces & BGP
[14:37:08] <paravoid>	 _joe_: you don't say..
[14:37:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:37:11] <mark>	 I think i'm more calm now than doing my regular job ;p
[14:37:16] <YuviPanda>	 godog also did you include the icinga prometheus check in our infrastructure?
[14:37:30] <_joe_>	 paravoid: ehehh
[14:38:25] <godog>	 YuviPanda: I didn't yet, no
[14:38:56] <YuviPanda>	 godog ok, let me know when you do :) also the graphs in admin grafana aren't showing up in readonly grafana, let me know if you have time to help investigate :)
[14:39:29] <paravoid>	 !log cr2-eqiad: reenabling IX interface & BGP
[14:39:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:40:34] <paravoid>	 !log cr2-eqiad: restoring PyBal BGP sessions
[14:40:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:42:59] <paravoid>	 !log cr2-eqiad: restoring VRRP priorities
[14:43:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:43:11] <grrrit-wm>	 (03PS2) 10MarcoAurelio: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) 
[14:43:58] <paravoid>	 !log cr2-eqiad is now upgraded, passing transit and cross-DC traffic and is the VRRP master in eqiad
[14:44:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:44:58] <grrrit-wm>	 (03PS1) 10Mobrovac: Revert "Change-prop: Ignore bot edits on ORES precache updates." [puppet] - 10https://gerrit.wikimedia.org/r/300282 
[14:45:26] <icinga-wm>	 RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms
[14:45:37] <grrrit-wm>	 (03PS3) 10MarcoAurelio: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) 
[14:48:26] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Change-prop: Ignore bot edits on ORES precache updates." [puppet] - 10https://gerrit.wikimedia.org/r/300282 (owner: 10Mobrovac)
[14:48:42] <godog>	 mobrovac: ^
[14:48:48] <icinga-wm>	 RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:48:49] <mobrovac>	 grazie godog!
[14:48:52] <godog>	 prego :)
[14:49:37] <_joe_>	 mobrovac: uh, what happened?
[14:50:10] <mobrovac>	 _joe_: a bug in the code of the extension sending the flag that is checked by changeprop :(
[14:50:19] <_joe_>	 lol
[14:52:10] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#2483487 (10Cmjohnson) Most all of the servers are removed...there are a few still in production  dbproxy1001 dbproxy1002 dbproxy1003 scandium uranium radium
[14:54:15] <grrrit-wm>	 (03PS6) 10Giuseppe Lavagetto: puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 
[14:55:03] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: Apache 2.4/jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/300242 (owner: 10Giuseppe Lavagetto)
[14:55:26] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2483494 (10AndyRussG) >>! In T140958#2483278, @BBlack wrote: > Yes, outside of global roots, access to any of the caches is pretty tightly restricted.  It's not just based on needs, b...
[14:56:39] <grrrit-wm>	 (03PS1) 10Cmjohnson: Removing mgmt dns from cp1043/1044 decom'd t133614 [dns] - 10https://gerrit.wikimedia.org/r/300284 
[14:58:32] <jynus>	 !log stopping dbstore1002 for scheduled maintenace T119488
[14:58:33] <stashbot>	 T119488: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488
[14:58:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:00:04] <jouncebot>	 anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1500). Please do the needful.
[15:00:04] <jouncebot>	 yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:12] <grrrit-wm>	 (03PS1) 10Gehel: New partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) 
[15:00:53] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2483500 (10bd808) The crons being on all role::logstash nodes was intentional because as you say multiple invocations of th...
[15:03:25] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2483501 (10Gehel) Ok, redundancy makes sense. I can delete all puppet managed crons and re-run puppet, which should cleanup...
[15:04:03] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: fix apache vhost syntax [puppet] - 10https://gerrit.wikimedia.org/r/300287 
[15:04:36] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: fix apache vhost syntax [puppet] - 10https://gerrit.wikimedia.org/r/300287 (owner: 10Giuseppe Lavagetto)
[15:07:57] <grrrit-wm>	 (03CR) 10Gehel: "Since this change does not seem to be needed, should we drop it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297459 (owner: 10DCausse)
[15:08:12] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300288 
[15:09:28] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300288 
[15:10:02] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/300288 (owner: 10Giuseppe Lavagetto)
[15:12:41] <wikibugs>	 06Operations, 10fundraising-tech-ops, 10netops: Cleanup layer2 firewall config from pfw-eqiad - https://phabricator.wikimedia.org/T111463#2483519 (10Jgreen)
[15:13:05] <wikibugs>	 06Operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#2483525 (10demon)
[15:13:07] <icinga-wm>	 RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.067 second response time
[15:13:07] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy server lead as jessie gerrit server - https://phabricator.wikimedia.org/T126794#2483521 (10demon) 05Open>03Resolved Lead is deployed and running gerrit on Jessie. It's just not the master yet. That's T70271.
[15:13:09] <_joe_>	 puppet failures are expected now
[15:13:24] <_joe_>	 I am running puppet on the puppet masters, and that will reload apache
[15:14:33] <wikibugs>	 06Operations, 10Gerrit, 13Patch-For-Review: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2483526 (10demon) 05Open>03Resolved a:03demon That public key won't be changing, neither will the ssh host key. I'm tentatively closing this.
[15:15:26] <wikibugs>	 06Operations, 10fundraising-tech-ops, 10netops: Cleanup layer2 firewall config from pfw-eqiad - https://phabricator.wikimedia.org/T111463#2483531 (10Jgreen) p:05Low>03High bumping to high because this blocks adding pfw ports, which in turn blocks hardware refreshes
[15:18:12] <grrrit-wm>	 (03CR) 10Eevans: [C: 04-1] "This isn't unreasonable, but I'm -1 for committing this cluster-wide at the moment. It should be tested in a more isolated manner first, " [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) (owner: 10GWicke)
[15:20:04] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2483556 (10Jgreen)
[15:21:04] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#2483574 (10Danny_B) p:05Triage>03Unbreak! Confirming.
[15:21:46] <subbu>	 _joe_, mobrovac when do you guys want to try a dummy parsoid deploy today to verify trebuchet deploys are fine? we can try any time or do it during the services window in ~90 odd mins.
[15:22:06] <_joe_>	 subbu: we're going into a meeting in 8 minutes
[15:22:18] <_joe_>	 so I'd say let's test it either right now 
[15:22:21] <_joe_>	 or in 40
[15:22:23] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2483582 (10Jgreen)
[15:22:27] <subbu>	 ok .. in 40 then.
[15:32:10] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Disable instance rebuild in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/300077 (https://phabricator.wikimedia.org/T140259) 
[15:32:12] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Use special monitor-account creds for the rabbitmq collector [puppet] - 10https://gerrit.wikimedia.org/r/300293 
[15:32:14] <grrrit-wm>	 (03CR) 10RobH: [C: 031] "a few notes:" [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel)
[15:33:04] <wikibugs>	 06Operations, 10ops-codfw, 10netops: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2483624 (10Papaul) ge-4/0/0 up up mw2239 ge-4/0/1 up up mw2240 ge-4/0/2 up up mw2241 ge-4/0/3 up up mw2242 ge-4/0/4 up up mw2243 ge-4/0/5 up up mw2244 ge-4/0/6 up up mw2245 ge-4/0/7 up up mw2246 g...
[15:34:38] <jynus>	 joal, did you just saw my email
[15:34:57] <jynus>	 joal, feel free to check if there is something broken on your side
[15:38:30] <jynus>	 there are api issues with wikidata
[15:38:36] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Disable instance rebuild in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/300077 (https://phabricator.wikimedia.org/T140259) (owner: 10Andrew Bogott)
[15:39:09] <jynus>	 "Wikibase\Repo\Store\WikiPageEntityStore::updateWatchlist: Automatic transaction with writes in progress (from DatabaseBase::query (LinkCache::addLinkObj)), performing implicit commit!"
[15:39:20] <jynus>	 It could be not issue, maybe just log noise?
[15:39:27] <jynus>	 issues*
[15:39:41] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Use special monitor-account creds for the rabbitmq collector [puppet] - 10https://gerrit.wikimedia.org/r/300293 (owner: 10Andrew Bogott)
[15:41:46] <grrrit-wm>	 (03PS1) 10ArielGlenn: clean up verbose mode print of commands to run [dumps] - 10https://gerrit.wikimedia.org/r/300294 
[15:42:33] <jynus>	 this seems to be happening since 10:10, but I do not see any deployments there
[15:44:37] <icinga-wm>	 PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail
[15:45:56] <jynus>	 oh, perfectly reported already! https://phabricator.wikimedia.org/T140955
[15:46:08] <wikibugs>	 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2483658 (10mark) @RobH could you prepare quotes for this? Thanks!
[15:46:16] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[15:46:46] <greg-g>	 jynus: :) :)
[15:46:47] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[15:46:54] <jynus>	 thanks, greg-g 
[15:46:56] <grrrit-wm>	 (03PS3) 10MarcoAurelio: Closing wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298772 (https://phabricator.wikimedia.org/T139032) 
[15:47:13] <wikibugs>	 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2483662 (10Florian)
[15:48:27] <jynus>	 specially puting the error/function on the title helps with the vilsibility (I also do that to avoid duplicate reports)
[15:48:47] <icinga-wm>	 RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:50:05] <greg-g>	 jynus: ditto
[15:50:08] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[15:50:16] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[15:51:10] <wikibugs>	 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2483687 (10RobH) a:03RobH
[15:55:05] <AndyRussG>	 bblack: hey! lmk if sometime you'd like to have another go at checking for CN cookies on a cache server. Sorry for unnecessarily opening the access task... I'd especially like to see the full details of the "*-campaign (where * = 'enwiki', 'eswiki', etc.)" bit from the previous attempt... thx!!!
[15:55:43] <AndyRussG>	 I believe full results can't be posted anywhere public due to privacy issues, so if you're OK with it, another channel could be found. Thx again :)
[15:56:16] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2483717 (10CCogdill_WMF) After pushing IBM for a couple weeks, they finally sent us this response today:  “After reviewing...
[15:56:18] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[15:57:07] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[16:00:05] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1600). Please do the needful.
[16:00:05] <jouncebot>	 hashar, urandom, and thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:12] <thcipriani>	 o/
[16:00:17] <bblack>	 AndyRussG: from the list we have so far, it seems like there's no general pattern that covers all CN cookies, right?  That might be another important thing going forward: given them all a common prefix or suffix, like "CN_"
[16:01:04] <greg-g>	 jynus: bah, missed your comment as I was typing mine, sorry for being redundant
[16:01:13] <jynus>	 np
[16:01:20] <jynus>	 it happens very frequently
[16:01:21] <AndyRussG>	 bblack: indeed. That's what we do have going forward :) The unpredictable names are basically from in-banner JS included in community banners over time
[16:02:00] <bblack>	 ok
[16:02:13] <greg-g>	 jynus: https://phabricator.wikimedia.org/T765 :)
[16:02:28] <jynus>	 nice
[16:02:30] <wikibugs>	 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2483755 (10debt) 05Open>03Resolved a:03debt
[16:02:52] <urandom>	 present: o/
[16:03:13] <greg-g>	 blocked on exposing websocket ports
[16:04:17] <grrrit-wm>	 (03PS2) 10Gehel: Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) 
[16:06:15] <godog>	 ok urandom first
[16:07:12] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans)
[16:07:19] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans)
[16:07:21] <urandom>	 godog: r300056 is already applied ephemerally everywhere
[16:07:29] <urandom>	 so it just makes sure it doesn't change back on a restart
[16:07:50] <godog>	 urandom: ah, ok thanks!
[16:08:15] <urandom>	 godog: r300059 is going to require restarts, but given the issues with that, i'll probably do it selectively at first
[16:08:29] <urandom>	 it only effects streaming though, so it's not something that would bring down the cluster
[16:08:45] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[16:09:57] <grrrit-wm>	 (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema)
[16:10:13] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483793 (10jcrespo) p:05Triage>03Normal dbproxy1002 seems to be back up again thanks to @fgiunchedi and @Joe. I will point the DNS back to the proxy again at an appropriate window.
[16:10:15] <godog>	 urandom: indeed, so to be sure, that means timeout: 0 across the board
[16:10:23] <urandom>	 yeah
[16:10:29] <urandom>	 which is what it was in 2.1, fwiw
[16:10:42] <godog>	 ok!
[16:10:48] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[16:10:59] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[16:11:26] <urandom>	 godog: thanks!
[16:11:32] <grrrit-wm>	 (03CR) 10RobH: [C: 031] Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel)
[16:11:33] <AndyRussG>	 bblack: thx again!!
[16:11:43] <godog>	 urandom: np!
[16:12:37] <_joe_>	 subbu: so let's try a deploy?
[16:12:42] <subbu>	 sure.
[16:12:51] <subbu>	 let me get onto tin
[16:13:15] <grrrit-wm>	 (03CR) 10Thcipriani: "Puppet compiler output: https://puppet-compiler.wmflabs.org/3426/" [puppet] - 10https://gerrit.wikimedia.org/r/300175 (owner: 10Thcipriani)
[16:13:39] <subbu>	 !log starting parsoid deployment
[16:13:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:13:53] <godog>	 thcipriani: you're up
[16:14:12] <thcipriani>	 godog: okie doke
[16:14:27] <thcipriani>	 (puppet compiler in the nick of time)
[16:14:42] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: Upgrade cr1/cr2-eqiad JunOS - https://phabricator.wikimedia.org/T140770#2483833 (10faidon)
[16:14:49] <AndyRussG>	 (also, I confess, for some time we did facilitate banners doing this in a couple ways, for the purpose of helping people limit banners shown... but we didn't consider the cookie consequences. The cookies created like this, i.e., with re-purposed JS from  FR-banners, and also from a briefly-deployed feature, are the ones where there are pairs with one ending in "-wait".)
[16:15:12] <subbu>	 _joe 44/45 minions completed fetch ...
[16:15:15] <AndyRussG>	 (bblack: ^)
[16:15:20] <subbu>	 so, the 45th minion is ruthenium?
[16:15:24] <_joe_>	 subbu: sigh, ruthenium?
[16:15:30] <_joe_>	 I damn removed it
[16:15:36] <subbu>	 yup. ruthenium
[16:15:37] <subbu>	 ruthenium.eqiad.wmnet: 
[16:15:37] <subbu>	 	fetch status: None [started: 1 mins ago, last-return: None mins ago]
[16:15:53] <subbu>	 so, should i continue or abort?
[16:15:56] <grrrit-wm>	 (03PS21) 10Filippo Giunchedi: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke)
[16:15:58] <thcipriani>	 have to remove it from the redis instance on tin to make it go away.
[16:16:02] <_joe_>	 abort I guess
[16:16:04] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke)
[16:16:05] <_joe_>	 thcipriani: I did
[16:16:17] <subbu>	 _joe_, aborted.
[16:16:29] <thcipriani>	 oh, weird. I thought you just meant removed the target from the instance grains
[16:16:51] <grrrit-wm>	 (03CR) 10GWicke: "@eevans: We have been running on significantly lower trickle fsync intervals before, and only increased it as a larger interval was still " [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) (owner: 10GWicke)
[16:16:58] <subbu>	 !log aborted (test) parsoid deployment
[16:17:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:17:33] <_joe_>	 subbu: let me inspect this again
[16:17:36] <subbu>	 k
[16:17:40] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2483845 (10faidon)
[16:17:42] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2483842 (10faidon) 05Resolved>03Open @cmjohnson, if I recall correctly, you swapped cr2's fantray with the new one but not cr1's, since they were the exact same model...
[16:18:11] <grrrit-wm>	 (03PS1) 10Chad: Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 
[16:18:28] <_joe_>	 thcipriani: is there a written procedure of how to remove a minion from trebuchet?
[16:18:58] <godog>	 IIRC it was on wikitech, involving redis
[16:19:11] <bblack>	 AndyRussG: I'm taking a 1h sample now, will report back later
[16:19:26] <_joe_>	 godog: I removed the minion from the list in redis yesterday
[16:19:27] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: Prerequisites for logstash_checker use [puppet] - 10https://gerrit.wikimedia.org/r/300175 (owner: 10Thcipriani)
[16:19:31] <_joe_>	 but it was back now
[16:19:38] <grrrit-wm>	 (03PS3) 10Gehel: Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) 
[16:19:43] <hashar>	 _joe_: https://phabricator.wikimedia.org/T132182
[16:19:51] <hashar>	 _joe_: err https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis
[16:19:54] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2483848 (10jcrespo) a:05jcrespo>03None I do not know why this is assigned to me, these requests should be handled by https://wikitech.wikimedia.org/wiki/O...
[16:20:31] <_joe_>	 hashar: what i did exactly...
[16:20:34] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "merging, though logstash_checker should be moved to service_checker package" [puppet] - 10https://gerrit.wikimedia.org/r/300175 (owner: 10Thcipriani)
[16:20:45] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483855 (10jcrespo) a:03jcrespo
[16:21:05] <thcipriani>	 _joe_: there was this: https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis but that doesn't capture the whole thing process, just the reporting. You'll also need to remove the grain from the instance otherwise trebuchet will try to use it again.
[16:21:22] * thcipriani looks up the name of the grain
[16:21:26] <hashar>	 _joe_: so I guess they are magically added back again due to a puppet deployment::target that is leftover  (pure speculation)
[16:22:08] <grrrit-wm>	 (03PS1) 10BBlack: puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 
[16:22:09] <thcipriani>	 ah the deployment_target: grain on the instance will have an array of things deployed to it
[16:22:50] <godog>	 hashar: where did you see https://gerrit.wikimedia.org/r/#/c/298568/2 failing btw?
[16:22:56] <godog>	 failing as in, not working as expected
[16:23:43] <_joe_>	 subbu: try now?
[16:23:49] <subbu>	 ok ..
[16:23:55] <grrrit-wm>	 (03PS1) 10Chad: Gerrit: Disable downloading of archives [puppet] - 10https://gerrit.wikimedia.org/r/300304 
[16:24:03] <subbu>	 !log starting (test) parsoid deployment
[16:24:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:24:25] <thcipriani>	 godog: thank you for the merges :)
[16:25:05] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] "We can directly link GitHub in phabricator :)" [puppet] - 10https://gerrit.wikimedia.org/r/300304 (owner: 10Chad)
[16:25:21] <godog>	 thcipriani: np, do you think it could be moved out of puppet anytime soon?
[16:25:29] <subbu>	 _joe_, 44/44 now .. so whatever you did worked.
[16:25:36] <subbu>	 continuing.
[16:25:39] <_joe_>	 actually I stated pretty clearly I wanted that not merged
[16:25:51] <_joe_>	 as it should've been moved to service-checker
[16:26:05] <_joe_>	 but well, now I'll have to do another transition, it's ok though
[16:26:11] <grrrit-wm>	 (03PS4) 10Paladox: phab: only mirror refs/heads/ and ./tags/ for mwcore and ops/puppet [puppet] - 10https://gerrit.wikimedia.org/r/295011 
[16:26:18] <_joe_>	 this isn't used in nagios, so it's simpler
[16:26:31] <_joe_>	 we also have no tests for it
[16:26:35] <_joe_>	 which is unfortunate
[16:26:55] <_joe_>	 anyways, whatever, it's very late (again) and I have to go in 10 minutes
[16:27:20] <godog>	 _joe_: heh I haven't seen your do not merge comment
[16:27:34] <_joe_>	 godog: I think there was no comment on the patch actually
[16:27:41] <_joe_>	 my bad
[16:27:50] <_joe_>	 that's why i am not complaining with you :)
[16:27:51] <thcipriani>	 ack, it can be moved. I really want to get something in place to catch terrible deploys before they hit production very soon, hence the sudden movement
[16:27:56] <subbu>	 !log synced parsoid code; restarting parsoid on wtp1001 as a canary
[16:28:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:28:02] <grrrit-wm>	 (03PS4) 10Gehel: Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) 
[16:28:05] <godog>	 ok but yeah not a huge deal
[16:28:13] <urandom>	 !log Cancelling 2003-c bootstrap, and disabling Puppet on restbase2003.codfw.wmnet to keep instance down : T134016
[16:28:14] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:28:14] <subbu>	 wtp1001 looking good .. restarting parsoid all nodes.
[16:28:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:29:27] <addshore>	 ottomata: could I get you to take a look at https://phabricator.wikimedia.org/T140342#2480251 ? :)
[16:29:32] <_joe_>	 subbu: ack, I am going off then
[16:29:52] <subbu>	 _joe_, thanks. looks good.
[16:30:22] <subbu>	 !log finished (test) deploy of parsoid sha ed2f8228
[16:30:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:31:19] <grrrit-wm>	 (03Abandoned) 10Paladox: phab: only mirror refs/heads/ and ./tags/ for mwcore and ops/puppet [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox)
[16:32:12] <godog>	 hashar: I'd like more reviews on https://gerrit.wikimedia.org/r/#/c/276346, we can talk about https://gerrit.wikimedia.org/r/#/c/298568/ tomorrow too
[16:33:05] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#2483898 (10Gehel) a:03EBernhardson This seems to be related to interwiki search. @EBernhardson has a patch already,...
[16:33:28] <grrrit-wm>	 (03CR) 10Ema: [C: 031] puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 (owner: 10BBlack)
[16:33:42] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Changed partition scheme for relforge (elasticsearch) servers [puppet] - 10https://gerrit.wikimedia.org/r/300286 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel)
[16:33:52] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: contint: APPEND unattended upgrade allowed-origins [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar)
[16:33:58] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: APPEND unattended upgrade allowed-origins [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar)
[16:34:06] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10MediaWiki-Search: Italian Wikivoyage search feature doesn't work with non existing pages - https://phabricator.wikimedia.org/T140982#2483910 (10EBernhardson)
[16:34:22] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: Connection refused
[16:34:22] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2483912 (10Ottomata) Naw, this is totally fine.  `analytics-wmde` is a user we created...
[16:34:23] <godog>	 hashar: nevermind, I thought https://gerrit.wikimedia.org/r/#/c/298568/ was global not only contint, merged
[16:34:30] <urandom>	 got that ^^^^
[16:34:41] <gehel>	 godog: sorry, I just merged a patch during your window...
[16:35:04] <godog>	 gehel: np, 99% of cases patches are ok to puppet-merge
[16:35:12] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-23 16:34:57.
[16:35:20] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: temporarily allow rhodium to compile all catalogs [puppet] - 10https://gerrit.wikimedia.org/r/300307 (https://phabricator.wikimedia.org/T98173) 
[16:35:25] <gehel>	 godog: should I merge mine and yours together?
[16:35:56] <godog>	 gehel: yeah go for it, can't merge separately I think
[16:36:05] <gehel>	 godog: done
[16:36:52] <urandom>	 !log T134016: Restarting Cassandra to apply new stream timeout (restbase2003-a.codfw.wmnet)
[16:36:53] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:36:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:37:08] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2483918 (10Joe) So, rhodium can now successfully compile its own catalog through the puppetmaster infrastructure (a...
[16:38:35] <urandom>	 !log T134016: Restarting Cassandra to apply new stream timeout (restbase2003-b.codfw.wmnet)
[16:38:36] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:38:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:39:22] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[16:39:32] <urandom>	 mine ^^^
[16:40:36] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Maintenance (T134016), be back soon. - The acknowledgement expires at: 2016-07-22 16:39:58.
[16:40:58] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: Upgrade cr1/cr2-eqiad JunOS - https://phabricator.wikimedia.org/T140770#2483946 (10faidon) OK, today we upgraded JunOS on cr2-eqiad to 13.3R9, as well as swapped the SCBs with new ones.  The JunOS upgrade all generally worked without many issues and took about ~2hrs. The...
[16:41:15] <urandom>	 !log T134016: Restarting Cassandra to apply new stream timeout (restbase200r-a.codfw.wmnet)
[16:41:16] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:41:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:42:18] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2483971 (10faidon) cr2's SCBs were upgraded today, which didn't go very smoothly for various reasons. T140770 has the full writeup.  cr2 still doesn't have the new linecard install,...
[16:42:52] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:42:53] * jynus fixes icinga check. /me realizes on merge conflict that someone had already sent a patch for it :-(
[16:43:12] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:43:21] <grrrit-wm>	 (03PS5) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) 
[16:43:36] <urandom>	 !log T134016: Restarting Cassandra to apply new stream timeout (restbase2004-b.codfw.wmnet)
[16:43:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:44:03] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:44:04] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2480587 (10Gehel) The NDA group grants access to grafana-admin and a [[ https://wikitech.wikimedia.org/wiki/LDAP_Groups | few more things ]]. If @Jonas has al...
[16:44:12] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200)
[16:44:17] <wikibugs>	 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2483976 (10ssastry) 05stalled>03Open
[16:44:36] <wikibugs>	 06Operations, 06Services: Move all Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#2483977 (10ssastry)
[16:46:09] <urandom>	 !log T134016: Restarting Cassandra to apply new stream timeout (restbase2008-a.codfw.wmnet)
[16:46:10] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:46:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:46:51] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.28.0-wmf.11/extensions/CirrusSearch/includes/Searcher.php: T140950: Deploy UBN fix to CirruSearch (duration: 00m 31s)
[16:46:52] <stashbot>	 T140950: Undefined property: CirrusSearch\InterwikiSearcher::$searchContext - https://phabricator.wikimedia.org/T140950
[16:46:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:47:38] <urandom>	 !log T134016: Restarting Cassandra to apply new stream timeout (restbase2008-b.codfw.wmnet)
[16:47:38] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:47:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:48:11] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[16:49:47] <grrrit-wm>	 (03PS6) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) 
[16:50:34] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2484019 (10Gehel) p:05Triage>03Normal
[16:50:51] <urandom>	 !log T134016: Restart of codfw rack 'c' instances to apply stream socket timeout complete
[16:50:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:51:10] <godog>	 chasemp YuviPanda andrewbogott the carbon-cache too many creates were from rabbitmq for labs, not a problem though just FYI
[16:51:12] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:51:32] <urandom>	 !log T134016: Starting bootstrap of restbase2003-c.codfw.wmnet
[16:51:33] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[16:51:34] <andrewbogott>	 godog: I don't know what that means; is it just because I restarted it too many times in a row?
[16:51:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:51:58] <grrrit-wm>	 (03CR) 10Krinkle: "it seems foundation: it still protocol-relative (but not wikimedia: and wmf:), is that intentional?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298696 (owner: 10Legoktm)
[16:52:39] <godog>	 YuviPanda: to answer your question, I played around with sth like your tool dropdown in https://prometheus.wmflabs.org/grafana/dashboard/db/http-s-tcp-probes-drilldown
[16:53:26] <godog>	 YuviPanda: so the instance name is a query to prometheus to auto-fill it based on what's there
[16:53:52] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2003 is OK: OK - cassandra-c is active
[16:53:55] <grrrit-wm>	 (03PS3) 10Yuvipanda: cold-migrate: use novaenv.sh for credentials [puppet] - 10https://gerrit.wikimedia.org/r/299602 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott)
[16:54:08] <grrrit-wm>	 (03PS4) 10Yuvipanda: cold-migrate: use novaenv.sh for credentials [puppet] - 10https://gerrit.wikimedia.org/r/299602 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott)
[16:54:31] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[16:54:33] <grrrit-wm>	 (03PS7) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) 
[16:54:39] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] cold-migrate: use novaenv.sh for credentials [puppet] - 10https://gerrit.wikimedia.org/r/299602 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott)
[16:54:49] <grrrit-wm>	 (03PS2) 10Yuvipanda: cold-migrate: activate/deactivate base image as needed. [puppet] - 10https://gerrit.wikimedia.org/r/299661 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott)
[16:55:07] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] cold-migrate: activate/deactivate base image as needed. [puppet] - 10https://gerrit.wikimedia.org/r/299661 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott)
[16:55:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[16:55:32] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[17:00:04] <jouncebot>	 yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1700).
[17:01:58] <grrrit-wm>	 (03CR) 10Paladox: "@Krinkle hi, I didn't do the logo's since it was late and I'm not sure how to." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[17:03:41] <wikibugs>	 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2484106 (10Dereckson)
[17:04:40] <cscott>	 jenkins/gerrit seems to be having problems.  known?
[17:05:28] <wikibugs>	 06Operations, 10Cassandra, 06Services, 10hardware-requests: 6x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2448119 (10RobH)
[17:05:54] <cscott>	 gerrit gave me an error when submitting a review comment ("Code Review - Error \n Server Unavailable \n 0")
[17:06:18] <greg-g>	 cscott: did it persist?
[17:06:24] <cscott>	 and jenkins jobs are failing trying to clone from gerrit
[17:06:30] <cscott>	 greg-g: yes, still won't submit
[17:06:39] <greg-g>	 did apergos file that task last night?
[17:06:47] <paladox>	  greg-g gerrit is slow for me too
[17:06:49] <wikibugs>	 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2484158 (10ssastry) Okay, 2 months later, we are now ready to pick this up again. @akosiaris @mobrovac .. is first week of August a good time to pick this up again?
[17:06:55] <paladox>	 but i think i know why
[17:07:10] <greg-g>	 07:47 <   apergos> !log restarted gerrit on ytterbium, it was refusing to complete git fetches for large repos (mw core, puppet...)
[17:07:20] <gehel>	 !log cleaning leftover crons on logstash* servers - T140973
[17:07:21] <stashbot>	 T140973: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973
[17:07:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:07:28] <greg-g>	 it happened last night and a.pergos restarted it to fix it
[17:07:42] <wikibugs>	 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10demon) Why on earth...
[17:07:48] <paladox>	 ostriches is pushing refs/changes/ for mw-core and operations/puppet to github mirror.
[17:07:54] <mutante>	 yes, gerrit is still here but slow
[17:08:08] <greg-g>	 that could be it, but then why did it happen last night at midnight pacific?
[17:08:34] <paladox>	 Not sure though.
[17:08:37] <mutante>	 midnight.. sounds like cron
[17:08:43] <greg-g>	 midnight ish
[17:08:48] <ostriches>	 That's not it.
[17:08:48] * greg-g looks at his logs
[17:08:52] <ostriches>	 And I'm not running that right now
[17:08:56] <paladox>	 Oh
[17:08:56] <cscott>	 https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm-composer/5154/console is a recent failure-to-clone.
[17:09:20] <godog>	 yeah gerrit is slow alright
[17:09:27] <Krinkle>	 gerrit http unresponsive 
[17:09:31] <paladox>	 17:01:59 git.exc.GitCommandError: 'git remote update origin' returned with exit code 1
[17:09:36] <paladox>	 It failed to connect.
[17:09:39] <grrrit-wm>	 (03CR) 10Paladox: "@MarcoAurelio there is no such file called that, but there is something called visualeditor-nondefault.dblist but it disables visualeditor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[17:09:42] <ostriches>	 Ugh.
[17:09:49] <bblack>	 17:01:59 git.exc.GitCommandError: 'git remote update origin' returned with exit code 1
[17:09:49] <greg-g>	 mutante: it wasn't on the hour mark, both before and after, afaict
[17:09:51] * ostriches puts on his workin' hat.
[17:09:52] <bblack>	 17:01:59 stderr: 'error: RPC failed; result=22, HTTP code = 503
[17:09:55] <greg-g>	 yeah
[17:09:57] <bblack>	 oops
[17:09:58] <ostriches>	 What gives.
[17:10:01] <greg-g>	 see eg from last night: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/113762/console
[17:10:11] <jynus>	 is it the regular "users exhausting connections" problem?
[17:10:17] <godog>	 btw gerrit has been bounced earlier today because of dbproxy1002, if that's relevant
[17:10:30] <paladox>	 I have never seen that happen before with gerrit.
[17:10:39] <paladox>	 Maybe someone could be attacking gerrit
[17:11:22] <mutante>	 ostriches: i am checking the mgmt console now
[17:11:59] <jynus>	 yes, the queue is full: https://wikitech.wikimedia.org/wiki/Gerrit#Tasks_management
[17:12:09] <jynus>	 should I start killing jobs?
[17:13:04] <ostriches>	 No.
[17:13:06] <ostriches>	 Please don't.
[17:13:10] <jynus>	 ok
[17:13:15] <jynus>	 that is why I asked
[17:13:24] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:13:26] <ostriches>	 !log gerrit: killed a couple of long-running git-upload-pack's for mediawiki/core
[17:13:28] <mutante>	 i cant login. want me to reboot ytterbium?
[17:13:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:13:34] <ostriches>	 I'm already logged in fine
[17:13:37] <mutante>	 ok
[17:13:44] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:14:00] <Nemo_bis>	 Looks like Nikerabbit's observation this morning in #wikimedia-releng had a reason :)
[17:14:14] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:14:24] <ostriches>	 Bleh.
[17:14:33] <paladox>	 gerrit is back
[17:14:34] <paladox>	 now
[17:14:37] <paladox>	 Works fine for me
[17:14:46] <paladox>	 Oh wait
[17:14:51] <paladox>	 the blue background is gone
[17:14:57] <paladox>	 with logo
[17:15:02] <ostriches>	 !log gerrit: restarting
[17:15:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:15:14] <ostriches>	 That's dumb gerrit.
[17:15:25] <ostriches>	 Why you explode on a bunch of git-upload-packs?
[17:15:31] <paladox>	 Oh ha
[17:15:46] <paladox>	 Maybe it is fixed in gerrit 2.12. so hopefully this problem wont happen
[17:15:47] <paladox>	 again
[17:15:49] <RoanKattouw>	 I'm now getting 503s instead of timeouts
[17:15:50] <paladox>	 after the upgrade
[17:15:57] <wikibugs>	 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10AlexMonk-WMF) Histor...
[17:16:00] <paladox>	 Gerrit works for me now
[17:16:02] <ostriches>	 RoanKattouw: Beause I just restarted it :p
[17:16:03] <jynus>	 paladox, it was restarting
[17:16:06] <paladox>	 it's backup
[17:16:09] <paladox>	 Oh, thanks
[17:16:13] <georg->	 RoanKattouw: it takes a bit, try again in some secs
[17:16:13] <RoanKattouw>	 Yup WFM now
[17:16:15] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: logstash - cron failing to optimize indices - https://phabricator.wikimedia.org/T140973#2484247 (10Gehel) Old crons have been cleaned. Let's wait a bit to see if we have other errors before closing this.
[17:16:46] <ostriches>	 I'm going to watch the task queue for awhile
[17:16:51] <ostriches>	 And dig and see wtf set this off.
[17:16:52] * greg-g really wanted to put "Status: Nominal"
[17:16:52] <mutante>	 maintenance mode for gerrit is being worked , btw
[17:17:14] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[17:17:15] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[17:17:24] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge.
[17:17:42] <ostriches>	 mutante: The maint_mode works, but only when we explicitly turn it on.
[17:17:44] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge.
[17:17:48] <ostriches>	 When it breaks, it still breaks :)
[17:18:00] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484278 (10mobrovac)
[17:18:57] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484297 (10mobrovac) @GWicke please approve as the services team manager, @Nuria please approve as the analytics team manager (the team owning...
[17:19:46] <apergos>	 greg-g: no, it was not a thing for a task
[17:20:07] <mutante>	 ostriches: right :)
[17:20:08] <apergos>	 gerrit was broken in a really weird way... anyways a kick made it happy again
[17:20:27] <greg-g>	 apergos: yeah, forgot you werejust going to kick it, it just happened again (most likely same cause?)
[17:20:41] <apergos>	 is it extension-dist? 
[17:20:46] <apergos>	 ostriches: 
[17:20:46] <ostriches>	 No
[17:20:48] <apergos>	 huh
[17:21:00] <apergos>	 then dunno
[17:21:23] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[17:21:23] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[17:21:27] <ostriches>	 I'm not entirely sure what happened. Same issue that happened the other day that Antoine noticed
[17:21:38] <ostriches>	 Underlying cause still unclear.
[17:21:46] <apergos>	 bummer
[17:21:54] <apergos>	 maybe we do need a task so we can collect info
[17:22:00] <ostriches>	 Symptoms: a few git-upload-pack start getting stuck.
[17:22:05] <ostriches>	 Others pile up.
[17:22:09] <ostriches>	 Queue gets unmanageable.
[17:22:10] <apergos>	 how can you tell they are stuck?
[17:22:13] <ostriches>	 Gerrit gets wobble.
[17:22:21] <godog>	 ostriches: do you know if there's any sort of metrics pushed associated with gerrit's jvm?
[17:22:24] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[17:22:41] <apergos>	 which is to say, how could I, not a gerrit expert, tell they are stuck?
[17:22:46] <ostriches>	 apergos: Eg from `gerrit show-queue -w`:
[17:22:48] <ostriches>	 187b18f2              15:53:55.868      git-upload-pack p/mediawiki/core.git
[17:22:51] <apergos>	 ah ha
[17:22:56] <ostriches>	 You'll see a few just sitting there.
[17:23:01] <apergos>	 gotcha
[17:23:03] <jynus>	 maybe we could implement a watchdog and solve it "by force"
[17:23:07] <ostriches>	 And the rest don't have a start time and are just "waiting....."
[17:23:27] <jynus>	 if you create pileups/take too much time, kill
[17:23:29] <ostriches>	 jynus: Doable. Or at least scriptable to do semi-automatically.
[17:23:52] <ostriches>	 (if we want human involvement)
[17:23:52] <jynus>	 it doesn't even have to be a fully automatic thing
[17:23:55] <wikibugs>	 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2484308 (10mobrovac) Looping @Joe and @Dzahn in too.  August works for me. @Joe, @akosiaris, @Dzahn ?  The plan is the following. Convert 2 to 3 machines in eqiad a...
[17:24:00] <jynus>	 it can be an icinga check
[17:24:01] <ostriches>	 Yeah, something like "You see it doing X, run Y"
[17:24:02] <ostriches>	 :)
[17:24:03] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[17:24:15] <ostriches>	 Er, icinga alerts to X, someone look, probably run Y
[17:24:16] <jynus>	 "more than X commands queued on gerrit"
[17:24:20] <jynus>	 yes
[17:25:24] <ostriches>	 godog: Re jvm stats, no. Monitoring for gerrit is pretty old/rudimentary. Could possibly reuse some of the stuff we use on Elastic.
[17:25:27] <ostriches>	 For basic JVM stuff.
[17:25:31] <apergos>	 oohhh
[17:25:40] <apergos>	 good that would save a bunch of digging around in traces and such
[17:25:47] <mutante>	 we can execute any script based on an icinga alert. possible in icinga. but we'd have to be _really_ sure that fully automatic is a good idea
[17:26:03] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5099119 keys - replication_delay is 0
[17:26:08] <ostriches>	 At the /bare minimum/ let's add an icinga check for a queue longer than like 50.
[17:26:11] <bblack>	 probably a bad idea to start adopting that pattern (icinga -> fixscript)
[17:26:12] <godog>	 ostriches: yep also I think would be useful to get jvm stats in graphite, for that I think we use jmxtrans with hadoop
[17:26:15] <ostriches>	 Anything longer is definitely a problem.
[17:26:22] <jynus>	 I am all for alert + documentation unless it is too frequent (it is not)
[17:26:29] <ostriches>	 (Probably could go shorter, but I'm afraid of false positives during bot actions like translatewiki)
[17:26:38] <logmsgbot>	 !log krinkle@tin Synchronized w/static.php: allow short-lived caching of 400/500 errors (duration: 00m 24s)
[17:26:38] <wikibugs>	 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2484328 (10greg)
[17:26:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:27:18] <ostriches>	 Ok, now how to expose that metric nicely...
[17:27:32] <apergos>	 that's two getting stuck in the same workday for me
[17:28:07] <jynus>	 apergos, there was a small outage before of m2
[17:28:14] <ostriches>	 show-queue is only over gerrit ssh. Which is A) Annoying because keypairs + access, and B) Not useful at all if SSH itself is lagging.
[17:28:23] <ostriches>	 Wonder if it's in the RPC api.
[17:28:25] <apergos>	 yes this was ssh only all right
[17:28:32] <apergos>	 _j oe_ figured that out
[17:28:41] <ostriches>	 And of course the .war file doesn't expose it
[17:28:48] <ostriches>	 So can't just use java from cli.
[17:28:51] <apergos>	 jynus: when was the m2 outage? 
[17:29:03] <apergos>	 of course you can't because that would make it easy, ostriches
[17:29:07] <apergos>	 god forbid
[17:29:41] <ostriches>	 Nope, no useful endpoint.
[17:29:42] <ostriches>	 :)
[17:29:43] <mutante>	 we could just focus on "it gets slow" like the humans detected it too
[17:29:54] <RoanKattouw>	 ostriches: Hmm now it seems that grrrit-wm doesn't work any more?
[17:30:06] <jynus>	 apergos, around here: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=17&fullscreen&from=1469107734548&to=1469110113985
[17:30:07] <apergos>	 you can see that there are git fetches stacked up on palladium 
[17:30:29] <apergos>	 I noticed that there is something that fires once a minute and it was stacking up, eventually the first ones would time out
[17:30:35] <apergos>	 but that's prety iffy to try to grab
[17:30:37] <mutante>	 RoanKattouw: that happens on every gerrit restart.
[17:30:37] <apergos>	 it's not like 50 
[17:30:51] <mutante>	 RoanKattouw: let me kick the bot
[17:30:59] <RoanKattouw>	 Thanks
[17:31:18] <apergos>	 that correlates to nothing I know of. hmm
[17:31:48] <ostriches>	 RoanKattouw: It never works after a gerrit restart.
[17:31:58] <ostriches>	 I have an outstanding offer of $20 to anyone who can make it auto-restart :p
[17:32:21] <apergos>	 oh wait, jynus was that the disk wipe? or am I misremembering?
[17:32:56] <mutante>	 pod "grrrit-wm-230500525-h411u" deleted
[17:32:56] <jynus>	 apergos, the link I sent you was mislading
[17:32:59] <mutante>	 ^ kubernetes
[17:33:00] <jynus>	 that was something else
[17:33:08] <mutante>	 !log restarted grrrit-wm
[17:33:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:33:55] <apergos>	 ETOOMANYISSUES
[17:34:30] <wikibugs>	 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2484392 (10demon) I'd prefer ju...
[17:35:13] <apergos>	 jynus: did it make it into a ticket someplace or not worth the bother? (if it did I'll follow along)
[17:35:27] <jynus>	 apergos, dbproxy1002?
[17:35:47] <apergos>	 yes the thing you were linking
[17:35:55] <apergos>	 but is misleading
[17:36:04] <jynus>	 gerrit was down between 12:43 and 12:58
[17:36:23] <jynus>	 apergos, https://phabricator.wikimedia.org/T140983
[17:36:29] <apergos>	 thanks
[17:36:50] <apergos>	 ok I remember this happening at the same time as toomanyissues
[17:36:52] <apergos>	 thanks
[17:36:53] <grrrit-wm>	 (03CR) 10Dzahn: "@mobrovac should i just merge it anytime?" [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn)
[17:37:10] <wikibugs>	 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2290725 (10greg) Add that T125003 subtask, but that might be the wrong one. Basically, we need to make sure Beta Cluster is updated before.
[17:37:29] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2484399 (10Pavanaja) >>! In T140898#2482774, @Dzahn wrote: > copying verbatim comment from @Glaisher on T134017#2253719 >  > --- >  > Could someone prov...
[17:39:18] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2484406 (10jcrespo) I am checking times, according to logs (request numbers are too low) gerrit and OTRS were down between 12:43 and 12:58.
[17:39:27] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484407 (10GWicke) Approved.
[17:43:19] <grrrit-wm>	 (03CR) 10Dzahn: "translations for namespaces have been provided now on https://phabricator.wikimedia.org/T140898#2484399 should that be also included here " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[17:45:26] <grrrit-wm>	 (03CR) 10Mobrovac: "As soon as tcy.wikipedia.org is up and kicking, yes. If that happens today, you can coordinate with Gabriel, Petr or Eric E. to restart RB" [puppet] - 10https://gerrit.wikimedia.org/r/300214 (https://phabricator.wikimedia.org/T140898) (owner: 10Dzahn)
[17:45:41] <wikibugs>	 06Operations, 10Flow, 10MediaWiki-Redirects, 03Collab-Team-Q1-July-Sep-2016, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#2484432 (10jmatazzoni)
[17:50:23] <wikibugs>	 06Operations, 10VisualEditor, 07Performance: fix puppet run on osmium (by either providing or removing chromium package on jessie) - https://phabricator.wikimedia.org/T141023#2484480 (10Dzahn)
[17:52:30] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T141023
[17:53:23] <wikibugs>	 06Operations, 10VisualEditor, 07Performance: fix puppet run on osmium (by either providing or removing chromium package on jessie) - https://phabricator.wikimedia.org/T141023#2484498 (10Dzahn)
[17:54:07] <wikibugs>	 06Operations, 10VisualEditor, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2484480 (10Dzahn)
[17:57:06] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 031] Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 (owner: 10MaxSem)
[18:01:19] <icinga-wm>	 RECOVERY - Ensure legal html en.m.wp on en.m.wikipedia.org is OK: all html is present.
[18:04:29] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10EventBus, 06Services: Allow the Services team to administer the eventbus services - https://phabricator.wikimedia.org/T141013#2484539 (10Nuria) @mobrovac . Approved on my end. Note that analytics folks (devs, not devops) also need permits, this includes: @mforns , @Mi...
[18:05:43] <wikibugs>	 06Operations, 03Discovery-Search-Sprint, 13Patch-For-Review: Elasticsearch index indexing slow log generates too much data - https://phabricator.wikimedia.org/T117181#2484569 (10debt) 05Open>03Resolved resolving this one (was still open but in the resolved column)
[18:09:33] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: Only use newer (elastic10{16..47}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2484587 (10debt) 05Open>03Resolved
[18:11:57] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2484591 (10debt)
[18:12:00] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, and 2 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484590 (10debt) 05Open>03Resolved
[18:12:52] <wikibugs>	 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2484595 (10ori) p:05Low>03High >>! In T140942#2483038, @Gehel wrote: > Triaging this as low priority to match T117470.  No, this should definitely have a higher...
[18:13:03] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2484612 (10debt)
[18:13:05] <wikibugs>	 06Operations, 06Discovery, 10Wikimedia-Logstash, 03Discovery-Search-Sprint, and 2 others: [EPIC] Upgrade elasticsearch cluster supporting logging to 2.3 - https://phabricator.wikimedia.org/T136001#2484609 (10debt) 05Open>03Resolved a:03debt
[18:15:32] <grrrit-wm>	 (03PS1) 10Dzahn: jsbench: chromium-browser on trusty, chromium on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) 
[18:17:41] <grrrit-wm>	 (03PS2) 10Dzahn: jsbench: chromium-browser on trusty, chromium on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) 
[18:19:19] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] "tnx" [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) (owner: 10Dzahn)
[18:20:37] <grrrit-wm>	 (03PS1) 10Chad: Gerrit: Further tweaks to down/maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/300323 
[18:20:53] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] jsbench: chromium-browser on trusty, chromium on jessie [puppet] - 10https://gerrit.wikimedia.org/r/300321 (https://phabricator.wikimedia.org/T141023) (owner: 10Dzahn)
[18:22:24] <grrrit-wm>	 (03CR) 10Chad: [C: 04-2] "My fear with this approach is that things will fail silently or in unexpected ways. Rather we should just ensure this is always available " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson)
[18:22:25] <icinga-wm>	 RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[18:24:16] <grrrit-wm>	 (03CR) 10Elukey: Move the Analytics Refinery role to scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey)
[18:30:46] <Bsadowski1>	 So block conflicts seem to be having a SQL database query problem?
[18:31:10] <Bsadowski1>	 I mean, I end up getting one when I block someone exactly at the same time as someone else
[18:31:24] <Bsadowski1>	 Just got one
[18:31:27] <Bsadowski1>	 "Function: IndexPager::buildQueryInfo (LogPager)"
[18:31:36] <Bsadowski1>	 "Error: 2013 Lost connection to MySQL server during query (10.64.32.25)"
[18:32:07] <grrrit-wm>	 (03PS3) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) 
[18:32:08] <Bsadowski1>	 Oh, right, there's a phab thing for this issue.
[18:32:55] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10Paladox) Could someone do the images please?  We need the normal image, a 1.5x image and 2x image please.  By image I mean logo please.
[18:34:21] <grrrit-wm>	 (03CR) 10Chad: "https://puppet-compiler.wmflabs.org/3432/ shows no real changes except addition of 503 directive to default apache config. I'm pretty sure" [puppet] - 10https://gerrit.wikimedia.org/r/300323 (owner: 10Chad)
[18:36:32] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] "Looks all good, and we will have a better looking maintenance page too." [puppet] - 10https://gerrit.wikimedia.org/r/300323 (owner: 10Chad)
[18:41:02] <Dereckson>	 ostriches: for 299996, in PS1, I offered a more conservative approach: use require_once and explicitely not require it when run from a maintenance script
[18:41:18] <Dereckson>	 ostriches: https://gerrit.wikimedia.org/r/#/c/299996/1/wmf-config/wikitech.php
[18:42:37] <ostriches>	 Heh, that could be one lined into defined( 'DO_MAINTENANCE' ) || include_once( ... )
[18:42:47] <ostriches>	 Which I guess works around the error, but still doesn't solve my problem.
[18:42:53] <ostriches>	 If the file should be loaded, it should always be loaded.
[18:42:58] <ostriches>	 Not just because the file DNE.
[18:43:05] <ostriches>	 Or we're doing maintenance.
[18:43:19] <ostriches>	 I'd /rather/ it fail hard and fast than unexpectedly and quiet.
[18:43:39] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) 
[18:44:52] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[18:46:02] <ori>	 omg, did not align arrows
[18:47:15] <icinga-wm>	 RECOVERY - configured eth on relforge1001 is OK: OK - interfaces up
[18:47:34] <icinga-wm>	 RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[18:47:56] <icinga-wm>	 RECOVERY - dhclient process on relforge1001 is OK: PROCS OK: 0 processes with command name dhclient
[18:48:05] <icinga-wm>	 RECOVERY - DPKG on relforge1001 is OK: All packages OK
[18:48:25] <icinga-wm>	 RECOVERY - Check size of conntrack table on relforge1001 is OK: OK: nf_conntrack is 0 % full
[18:48:44] <icinga-wm>	 RECOVERY - Disk space on relforge1001 is OK: DISK OK
[18:49:32] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey)
[18:52:44] <icinga-wm>	 RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[18:55:20] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) 
[18:57:38] <Dereckson>	 ostriches: what about create a Puppet class to provision an empty /etc/mediawiki/WikitechPrivateSettings.php file, add it to deployment::server (tin, mira), mediawiki::maintenance (terbium, wasat, mw1152) roles?
[18:58:41] <ostriches>	 That still doesn't solve the problem. It should be on all MW nodes.
[18:59:01] <ostriches>	 An empty file just means we (once again) fail quietly because we're misconfigured.
[19:00:04] <jouncebot>	 ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T1900). Please do the needful.
[19:03:59] <ostriches>	 Dereckson: And if it shouldn't be on all MW nodes, then those nodes (maintenance, deploy masters) shouldn't be able to mess with it via maintenance.
[19:04:07] <ostriches>	 (which also seems wrong if tin/mira cant)
[19:04:25] <grrrit-wm>	 (03PS2) 10Chad: Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 
[19:04:45] <grrrit-wm>	 (03CR) 10Jforrester: "Do we know what the current regular rates of these are?" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[19:05:46] <grrrit-wm>	 (03CR) 10Ori.livneh: "@Jforrester, https://graphite.wikimedia.org/S/Bf" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[19:05:56] <icinga-wm>	 RECOVERY - NTP on relforge1001 is OK: NTP OK: Offset -0.01760518551 secs
[19:07:18] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 (owner: 10Chad)
[19:07:41] <James_F>	 ori: Thanks. Does that mean we'll get pages several times a day at current rates?
[19:07:50] <grrrit-wm>	 (03Merged) 10jenkins-bot: Last wikis to wmf.11 (group2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300298 (owner: 10Chad)
[19:08:48] <James_F>	 From eyeballing that, 02:00, 04:30 (big), 09:30 (big), 13:00 (just), 15:00 in the last 24 hours.
[19:09:31] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: last wikis to wmf.11
[19:09:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:09:39] <ori>	 James_F: Yeah. I'm glad you're looking -- this could use another pair of eyes. What do you think the thresholds should be?
[19:10:09] <grrrit-wm>	 (03PS1) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) 
[19:10:22] <ostriches>	 Maybe start at 25/50?
[19:10:31] <James_F>	 ori: Well, I think the thresholds are if anything too high, but… eh. Maybe 20/40 instead of 15/25?
[19:10:34] <ori>	 'warn' is mostly meaningless, since the threshold for alerting on irc / paging is crit
[19:10:35] <James_F>	 Or what Chad said.
[19:10:48] <ostriches>	 From what I see in logstash's mw error channels, that seems low enough to trigger for Bad Stuff, but high enough to not needlessly flap (which we probably would do at first)
[19:10:58] * ori nods
[19:11:01] <ori>	 sounds good, I'll update the patch
[19:11:02] <James_F>	 Can we get IRC pings for warn as well on this one?
[19:11:10] <James_F>	 (If that's hard, never mind.)
[19:11:14] <ostriches>	 ori: I'm all for lowering that in time.
[19:11:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk)
[19:11:19] <James_F>	 +1
[19:11:20] <ostriches>	 less errors -> happy chad
[19:11:39] <James_F>	 And happy users -> happy James.
[19:11:48] <ori>	 yeah, but I understand the concern -- if this is too noisy it'll train people to ignore it
[19:12:12] <grrrit-wm>	 (03PS3) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) 
[19:12:25] <James_F>	 I'm also interested in possibly moving alerts about the MW part of the stack to a different IRC channel to the ones about the metal.
[19:12:32] <ostriches>	 We should also run down more of these "failed to connect to redis" ones.
[19:12:32] <grrrit-wm>	 (03CR) 10Dereckson: "@MarcoAurelio @Paladox We reversed the VE logic: all wikis now have it, instead those in visualeditor-nondefault.dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[19:12:41] <ostriches>	 Either they don't need to log or they need to be more annoying.
[19:12:42] <grrrit-wm>	 (03PS4) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) 
[19:12:47] <ostriches>	 Right now I mostly see them as spam in logstash
[19:13:00] <James_F>	 'Cos this channel has notifications about puppet (which mere deployers can't do anything about) and about deployments (which they can).
[19:13:35] <grrrit-wm>	 (03CR) 10Dereckson: "(excepted)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[19:13:36] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2484831 (10Paladox) >>! In T140898#2484399, @Pavanaja wrote: >>>! In T140898#2482774, @Dzahn wrote: >> copying verbatim comment from @Glaisher on T13401...
[19:13:37] <James_F>	 I come back to this channel and there are often >1k messages over night since I log off around 22:00.
[19:13:46] <ori>	 we can echo the alerts on additional channels, but I would like -operations to provide a synoptic view of site reliability
[19:14:00] <ori>	 so I'd add channels rather than move it
[19:14:01] <James_F>	 Oh, sure. It's mostly a call for ostriches and the rest of RelEng.
[19:14:18] <James_F>	 Absolutely, don't want to reduce the value of this channel to others.
[19:14:26] * ostriches already reads all the things
[19:14:29] * ostriches also has no life
[19:15:00] <ori>	 if one of you +1s i'll merge it
[19:15:11] <grrrit-wm>	 (03CR) 10Jforrester: [C: 031] Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[19:15:27] <grrrit-wm>	 (03PS5) 10Ori.livneh: Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) 
[19:15:30] <grrrit-wm>	 (03CR) 10Chad: [C: 031] "+1s is almost sorta (nothing like) a +2" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[19:16:03] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2484834 (10RobH)
[19:16:05] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, and 2 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484832 (10RobH) 05Resolved>03Open #debt: this is not closed, as I have not finished the decommission process.  Please don't resolve this task.
[19:16:55] <ostriches>	 ori: How much do you know about sms paging from icinga?
[19:17:13] <ori>	 not much, but ask anyway
[19:17:34] <ostriches>	 Can we make a group for releng that *does* SMS paging for releng?
[19:17:34] <wikibugs>	 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484836 (10RobH)
[19:17:38] <ostriches>	 Eh, said releng twice.
[19:17:46] <ori>	 yes
[19:18:21] <ostriches>	 I know I get e-mails when PHD dies (or did), but I need something more eye-catching
[19:18:23] <ostriches>	 like sms :p
[19:18:45] <ori>	 see the git log for modules/nagios_common/files/contactgroups.cfg
[19:18:57] <ostriches>	 Yeah I'm in there in a couple of groups.
[19:19:03] <grrrit-wm>	 (03CR) 10Greg Grossmeier: "Right now that's (over the last 24hours) (eye-balling here):" [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[19:19:05] <ostriches>	 I'm just not sure how that ties to sending me a text
[19:19:17] <mutante>	 there is one special contact group called "sms"
[19:19:23] <mutante>	 if you are in there you get paged
[19:19:31] <mutante>	 but then you get all the ops pages currently
[19:19:46] <ostriches>	 Yeah, that's not what I want. I want something like sms-releng
[19:19:52] <ostriches>	 (or able to trigger sms for random groups)
[19:19:58] <ostriches>	 Whichever is possible
[19:20:12] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add alerting for MediaWiki exceptions and fatals [puppet] - 10https://gerrit.wikimedia.org/r/300327 (https://phabricator.wikimedia.org/T140942) (owner: 10Ori.livneh)
[19:20:42] <mutante>	 yea, we dont have that as a feature yet
[19:20:44] <wikibugs>	 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484855 (10debt) Whoops, sorry! Please continue doing what needs to be done and thanks for removing the search tags. :)
[19:20:49] <mutante>	 it's a bit complicated 
[19:20:53] <James_F>	 I thought we had service groups?
[19:21:13] <James_F>	 I only get (got?) Parsoid pages, not general Ops ones.
[19:21:17] <ori>	 but apparently only one that is hooked up to SMS
[19:21:34] <robh>	 ostriches: so you'll need an opsen to pusht he contacts.cfg changes to add the individual data for each sms person
[19:21:40] <robh>	 ack, i got stuck in backlog
[19:21:42] <mutante>	 we have service groups for teams and stuff
[19:21:44] <robh>	 sorry, outdated comment!
[19:21:49] <mutante>	 with email notification
[19:21:59] <James_F>	 Ah, but the groups are not for SMS? OK.
[19:22:03] <mutante>	 but we need to add the SMS notification method 
[19:22:05] <mutante>	 yes
[19:22:10] <robh>	 mutante: uh, i thought we paged some folks not in ops already?
[19:22:20] <robh>	 like services?  (maybe im misrecalling)
[19:22:35] <robh>	 oh, james asked shit im on a delay it seems.
[19:22:43] * robh is just not with it today
[19:24:20] <James_F>	 https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/nagios_common/files/contactgroups.cfg shows only Opsen getting SMSes.
[19:24:37] <mutante>	 robh:do we? it's possible that we added it to an individual contact.. 
[19:25:10] <ostriches>	 James_F: Yeah, that's the sms group. Which I could technically add myself to, but I don't need alerts with cr1 goes flapping (for example). Nothing I can do about it.
[19:25:18] <mutante>	 yea, first we had no custom groups at all.. then we did that with email
[19:25:23] <James_F>	 ostriches: Indeed.
[19:25:27] <robh>	 i thought we had made it more granular for individual pages
[19:25:38] <robh>	 but i never got anythig but all the pages so i could be easily mistaken.
[19:25:50] <greg-g>	 how hard is it to add sms-groups?
[19:25:58] <greg-g>	 seems like an obvious thing for services, no?
[19:26:14] <mutante>	 not easy enough
[19:26:23] <mutante>	 i remember we looked before
[19:26:32] <mutante>	 lets make a ticket , will also check for an old one
[19:26:47] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master).
[19:26:54] <greg-g>	 mutante: thanks, I'll subscribe :)
[19:27:06] <wikibugs>	 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2484860 (10Jdforrester-WMF) Osmium is now fixed, so this can be closed? Thank you.
[19:27:19] <wikibugs>	 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484861 (10RobH)
[19:27:42] <James_F>	 ori: If you want to create a "mediawiki" group for those alerts, please add me to it (even if it doesn't get SMSes).
[19:27:55] <logmsgbot>	 !log demon@tin Synchronized wikiversions.json: because sync-wikiversions doesn't care about co-masters ugh (duration: 00m 29s)
[19:28:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:28:56] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[19:29:11] <mutante>	 robh: ostriches: just remembred something.. so we had this same question for ores and h.alfak
[19:29:38] <mutante>	 and what he did is get email from icinga and then forward it to mail2sms gateway himself
[19:29:53] <mutante>	 so that results in paging without us having it implemented like that
[19:30:20] <robh>	 ha!  well, if they are a US carrier, we can put their contact email address as their sms email
[19:30:23] <robh>	 thats a work around as well
[19:30:30] <mutante>	 yes, that
[19:30:35] <robh>	 actually, can do that with either but it will be a messy format
[19:30:40] <robh>	 which may be non ideal.
[19:30:43] <mutante>	 the notification type SMS in icinga. is also jsut email
[19:30:54] <mutante>	 just a special type of email that turns it into an SMS
[19:31:03] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) 
[19:31:12] <robh>	 well, its also a more terse format for short text reading
[19:31:16] <robh>	 yes?
[19:31:16] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) 
[19:31:21] <robh>	 but otherwise same content overall
[19:31:22] <mutante>	 and depending on the provider it's just  something like <number>@txt.att.com etc
[19:31:38] <robh>	 so its still non ideal due to formatting i would think.
[19:31:52] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) 
[19:32:01] <grrrit-wm>	 (03PS4) 10Yuvipanda: tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) 
[19:32:08] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add check for high iowait [puppet] - 10https://gerrit.wikimedia.org/r/300334 (https://phabricator.wikimedia.org/T141017) (owner: 10Yuvipanda)
[19:34:18] <wikibugs>	 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484870 (10RobH) No worries, I figured removing the tags would clear it from your workboards/radar =]
[19:39:04] <wikibugs>	 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2484884 (10RobH) a:05RobH>03Cmjohnson
[19:39:37] <wikibugs>	 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2441721 (10RobH) Assigned to @cmjohnson for ssh removal/disk wipe before unracking.  Once they are unracked (and added to decom tracking sheet), their mgmt dns entries can be pulled.
[19:40:11] <mutante>	 robh: that is true about formatting.. and i found the line where the format is set. we can do something there
[19:40:32] <mutante>	 like host-notify-by-sms-gateway-SERVICE
[19:40:42] <mutante>	 but on ticket is good
[19:41:08] * robh is decommissioning all the things
[19:41:24] <wikibugs>	 06Operations, 10Icinga: implement icinga paging for non-ops teams - https://phabricator.wikimedia.org/T141038#2484903 (10Dzahn)
[19:41:29] <mutante>	 robh: :)
[19:41:35] <robh>	 see, we gave cmjohnson1 enough time to dig out from under a pile of new metal.  now we're gonna bury him under old metal.
[19:41:39] <grrrit-wm>	 (03PS6) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[19:41:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[19:41:59] <cmjohnson1>	 hahaha
[19:43:33] <cmjohnson1>	 let's just get that old metal off the racks so I can get rid of it all at one time
[19:44:06] <robh>	 yeah i just pushed the one for the elastic1001-1016 for you to wipe and unrack (or remove ssds and unrack)
[19:44:12] <grrrit-wm>	 (03PS7) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[19:44:17] <robh>	 more things to pull, huzzah
[19:46:02] <grrrit-wm>	 (03CR) 10Jforrester: [C: 031] Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[19:49:39] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[19:53:12] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "Some images issues to fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[19:55:30] <grrrit-wm>	 (03CR) 10Dereckson: "Namespaces should use space, not underscore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[19:55:59] <grrrit-wm>	 (03CR) 10Dereckson: "Namespaces should use underscores, not spaces." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) (owner: 10Paladox)
[20:01:18] <ori>	 ostriches: I'm going to throw some 50 MWExceptions from eval.php on tin just to test the alert
[20:01:24] <ostriches>	 k
[20:01:36] <ostriches>	 James_F: Why do we have securepollglobal.dblist? We've it's basically all.dblist - (loginwiki, labswiki, labstestwiki, zerowiki)
[20:02:04] <James_F>	 ostriches: Probably it's used for the maintenance script and it was easier for you/Reedy/Roan at the time. ;-)
[20:02:40] <ostriches>	 Ah could be
[20:02:54] <James_F>	 ostriches: There are several low-value dblists it'd be nice to kill.
[20:03:30] <James_F>	 ostriches: And v.v., it might be sensible to define "all - loginwiki - votewiki" as a list, given how often we set that in InitSettings.
[20:04:01] <ostriches>	 Yeah, I'm not opposed to keeping lists around if they can use expressions
[20:04:21] <ostriches>	 It's mainly: adding to all.dblist should result in expected defaults, not "you also gotta add it to foo"
[20:04:27] <RoanKattouw>	 No all - nonglobal ?
[20:04:44] <James_F>	 Yup.
[20:04:54] <James_F>	 Hence why I moved to ve-nondefault.
[20:04:55] <James_F>	 Etc.
[20:05:00] <ostriches>	 chad@notsexy /a/ops/mediawiki-config/dblists (master)$ diff all.dblist securepollglobal.dblist 
[20:05:00] <ostriches>	 438,439d437
[20:05:00] <ostriches>	 < labswiki
[20:05:00] <ostriches>	 < labtestwiki
[20:05:00] <ostriches>	 464d461
[20:05:01] <ostriches>	 < loginwiki
[20:05:01] <ostriches>	 879d875
[20:05:02] <ostriches>	 < zerowiki
[20:06:32] <grrrit-wm>	 (03PS24) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (https://phabricator.wikimedia.org/T138501) 
[20:07:28] <ostriches>	 Yeah, so securepollglobal.dblist is used in SecurePoll for page creation or something
[20:07:35] <ostriches>	 It needs a list of db names.
[20:07:44] <ostriches>	 But that list seems wrong as-is.
[20:09:06] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel)
[20:09:28] <wikibugs>	 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2484957 (10Dzahn) I wanted to check one last thing. i saw "chromium-browser" was used in a script. getting on this now
[20:10:36] <grrrit-wm>	 (03PS2) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) 
[20:10:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[20:10:52] <ori>	 \o/
[20:11:00] <ori>	 tgr: ^
[20:11:10] <ori>	 (just a test)
[20:11:15] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[20:12:14] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk)
[20:14:10] <grrrit-wm>	 (03PS8) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[20:16:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[20:22:07] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:22:35] <icinga-wm>	 RECOVERY - salt-minion processes on relforge1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:23:07] <grrrit-wm>	 (03CR) 10Jforrester: Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester)
[20:29:57] <icinga-wm>	 PROBLEM - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[20:31:53] <grrrit-wm>	 (03PS9) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[20:35:04] <grrrit-wm>	 (03PS3) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) 
[20:38:18] <wikibugs>	 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2485115 (10ori)
[20:40:54] <grrrit-wm>	 (03PS10) 10Paladox: Initial configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) 
[20:40:56] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 
[20:48:35] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:49:27] <wikibugs>	 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: fix puppet run on osmium (remove or provide chromium-browser package on jessie) - https://phabricator.wikimedia.org/T141023#2485183 (10Dzahn) There is a custom upstart script to start chromium-browser that is puppetized. But that needs to be...
[20:53:28] <grrrit-wm>	 (03PS4) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) 
[20:53:58] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 
[20:54:49] <wikibugs>	 06Operations, 10ops-eqiad: db1011 disk failure (degraded RAID) - https://phabricator.wikimedia.org/T141046#2485211 (10jcrespo)
[20:55:39] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1011 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T141046
[20:56:01] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk)
[20:58:11] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2485251 (10jcrespo) ``` megacli -PDRbld -ShowProg -PhysDrv'[32:6]' -a0                                       Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 98% in 908 Minutes. ```
[20:59:32] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2485255 (10jcrespo) 05stalled>03Resolved a:03Cmjohnson
[21:00:55] <grrrit-wm>	 (03PS1) 10Gehel: Actually create initial import script for OSM data [puppet] - 10https://gerrit.wikimedia.org/r/300410 (https://phabricator.wikimedia.org/T138501) 
[21:01:28] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2485260 (10jcrespo) It seems dbproxy1002 was "accidentally" upgraded to jessie today: T140983
[21:03:41] <grrrit-wm>	 (03PS5) 10Alex Monk: [WIP/POC/POS] Puppetise script to manage labs floating IP PTR records [puppet] - 10https://gerrit.wikimedia.org/r/300331 (https://phabricator.wikimedia.org/T104521) 
[21:05:05] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2485265 (10jcrespo) We need to revert https://gerrit.wikimedia.org/r/300254 once we check everything is working and have a window where it is not disruptive.
[21:05:45] <grrrit-wm>	 (03CR) 10MaxSem: [C: 031] Actually create initial import script for OSM data [puppet] - 10https://gerrit.wikimedia.org/r/300410 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel)
[21:08:05] <wikibugs>	 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2485278 (10Tgr) Thinking about this more, not sure if login/signup metrics are worth the effort. One of the strengths of Wikimedia is the stro...
[21:08:12] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2485279 (10jcrespo)
[21:08:17] <wikibugs>	 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2485280 (10jcrespo)
[21:08:52] <grrrit-wm>	 (03PS2) 10Reedy: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) 
[21:08:56] <grrrit-wm>	 (03PS3) 10Reedy: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) 
[21:09:58] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] "Removed dependency so this can go out first (co-exists with config already in SiteMatrix no issue)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) (owner: 10Reedy)
[21:10:41] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Actually create initial import script for OSM data [puppet] - 10https://gerrit.wikimedia.org/r/300410 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel)
[21:10:57] <grrrit-wm>	 (03Merged) 10jenkins-bot: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) (owner: 10Reedy)
[21:11:54] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/CommonSettings.php: Moved WMF specific SiteMatrix data to CommonSettings (duration: 00m 26s)
[21:11:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:18:41] <grrrit-wm>	 (03PS1) 10Gehel: Maps - initial import script [puppet] - 10https://gerrit.wikimedia.org/r/300423 (https://phabricator.wikimedia.org/T138501) 
[21:23:51] <icinga-wm>	 PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:26:02] <icinga-wm>	 PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:26:12] <gehel>	 ^ maps2001 is me, patch coming up...
[21:26:30] <grrrit-wm>	 (03PS1) 10Dzahn: jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) 
[21:26:38] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Maps - initial import script [puppet] - 10https://gerrit.wikimedia.org/r/300423 (https://phabricator.wikimedia.org/T138501) (owner: 10Gehel)
[21:27:22] <grrrit-wm>	 (03PS2) 10Dzahn: jsbench: add systemd compat for jsbench-browser [puppet] - 10https://gerrit.wikimedia.org/r/300425 (https://phabricator.wikimedia.org/T141023) 
[21:28:24] <wikibugs>	 06Operations, 10Monitoring, 06Release-Engineering-Team, 13Patch-For-Review: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2485421 (10Jdforrester-WMF) >>! In T140942#2485278, @Tgr wrote: > Thinking about this more, not sure if login/signup metrics are worth the eff...
[21:30:00] <icinga-wm>	 RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:31:03] <grrrit-wm>	 (03PS3) 10Dzahn: admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) 
[21:31:26] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] admin: add shell account for Jasmeet Samra [puppet] - 10https://gerrit.wikimedia.org/r/300196 (https://phabricator.wikimedia.org/T140445) (owner: 10Dzahn)
[21:35:46] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 3 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2485435 (10greg)
[21:43:55] <grrrit-wm>	 (03PS3) 10Dzahn: remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) 
[21:48:19] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 
[21:49:35] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] remove all aluminum/aluminium remnants [dns] - 10https://gerrit.wikimedia.org/r/300213 (https://phabricator.wikimedia.org/T140676) (owner: 10Dzahn)
[21:50:38] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Upgrade labtest to openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/300347 (owner: 10Andrew Bogott)
[21:52:34] <icinga-wm>	 RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:52:59] <grrrit-wm>	 (03PS2) 10BBlack: puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 
[21:53:08] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] puppet @var warning on fastopen_pending_max [puppet] - 10https://gerrit.wikimedia.org/r/300302 (owner: 10BBlack)
[21:58:59] <wikibugs>	 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2485532 (10GWicke) After investigating this for a while I am now fairly certain that that the master process exit was indeed caused by a DNS resoluti...
[22:01:36] <mutante>	 !log stat1002 - puppetized git pull from "refinery_source" fails 
[22:01:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:02:44] <icinga-wm>	 RECOVERY - MegaRAID on dbstore1002 is OK: OK: optimal, 1 logical, 2 physical
[22:19:46] <mutante>	 "Please do not submit patches through LinkedIn, or at the very least submit it as an unified diff"  hahah
[22:21:04] <greg-g>	 ?
[22:21:15] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.113:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.113, port=9200): Read timed out. (read timeout=4)
[22:21:48] <paladox>	 greg-g it is on someone linkedin profile
[22:21:59] <paladox>	 who works for wikimedia foundation
[22:22:08] <bd808>	 !log Restarted kibana4 on logstash1001 for "node[18588]: segfault at 2fcb25f00009 ip 0000000000ad9846 sp 00007ffe526bbb40 error 4 in node[400000+1383000]"
[22:22:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:25:04] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.32.137, port=9200): Read timed out. (read timeout=4)
[22:28:20] <mutante>	 greg-g: hashar had to put that on his profile. i guess they tried to send him patches that way 
[22:28:59] <paladox>	 LOL
[22:38:48] <wikibugs>	 06Operations, 10Analytics-Cluster: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2485674 (10Dzahn)
[22:39:14] <wikibugs>	 06Operations, 10Analytics-Cluster: stat1002 - puppet fails to git pull refinery_source - https://phabricator.wikimedia.org/T141062#2485686 (10Dzahn)
[22:40:30] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T141062
[22:42:06] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix up base wiki handling for onallwikis [dumps] - 10https://gerrit.wikimedia.org/r/300437 
[22:42:26] <mutante>	 CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: HTTPConnectionPool(host='10.64.32.137', port=9200): Read timed out. (read timeout=4) 
[22:42:36] <mutante>	 are these decom'ed elasticsearch servers?
[22:42:51] <mutante>	 because i see in backlog things like "elastic1001-1016 for you to wipe and unrack"
[22:43:07] <ebernhardson>	 mutante: if its 1001-1016, then yes. checking if it is
[22:43:30] <mutante>	 http://10.64.32.137:9200/ and  http://10.64.48.113:9200/_
[22:43:52] <mutante>	 1002, 1003
[22:43:57] <ebernhardson>	 mutante: thats logstash1002 actually
[22:44:03] <ebernhardson>	 something is up with that server, looking
[22:44:08] <mutante>	 oh, right
[22:44:09] <mutante>	 thanks
[22:44:16] <ostriches>	 Er, bd808 ^
[22:46:20] <ebernhardson>	 !log restart elasticsearch on logstash1002
[22:46:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:48:48] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 26, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards
[22:50:05] <ebernhardson>	 looks like it hit a java OOM, then had some issues. logstash1001-3 have their heap set to 2G, might be worthwhile to increase it. Will have to check with bd808 on that though
[22:50:26] <ostriches>	 elastic heap at 2g?
[22:50:28] <grrrit-wm>	 (03PS3) 10Dzahn: Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey)
[22:50:30] <ostriches>	 how does it liveeeeee?
[22:50:35] <bd808>	 ebernhardson: up to you :) you touched it last
[22:50:38] <ebernhardson>	 ostriches: 1001-3 aren't data nodes
[22:50:50] <bd808>	 and also ostriches is breaking stuff for fun I think ;)
[22:50:51] <ostriches>	 Speaking of heap, I should raise gerrit's on lead.
[22:50:51] <ebernhardson>	 they are basically just routers
[22:51:13] <ostriches>	 bd808: Yeah, making a visualization. A lot to ask for a log visualization platform :p
[22:51:19] <ebernhardson>	 the new version of es we deployed might be a bit more memory hungry than the 1.7
[22:51:30] <bd808>	 *nod*
[22:52:09] <bd808>	 logstash1001 is showing 4G of free ram
[22:52:49] <bd808>	 maybe bump from 2G to 4G?
[22:53:21] <bd808>	 those nodes should really only need ES ram to do aggregations but maybe we are doing more now
[22:54:15] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10Parsoid, 10Salt: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#2485714 (10ggellerman)
[22:54:27] <ebernhardson>	 yea i created a ticket to bump from 2G to 4G. The increased usage of aggregations makes sense for pushing it up
[22:57:04] <grrrit-wm>	 (03PS1) 10EBernhardson: Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) 
[23:00:05] <jouncebot>	 RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160721T2300).
[23:00:05] <jouncebot>	 Addshore, Jdlrobson, Pchelolo, James_F, MaxSem, and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:14] <grrrit-wm>	 (03PS4) 10Dzahn: Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey)
[23:00:18] <Pchelolo>	 I'm here
[23:00:40] * MaxSem looks around
[23:00:51] <ebernhardson>	 MaxSem: that looks like a volunteer!
[23:01:02] <jdlrobson>	 here
[23:01:02] <James_F>	 Heya.
[23:01:03] <MaxSem>	 aaaaaaaaaaaaaaá
[23:02:33] <grrrit-wm>	 (03PS1) 10Ppchelko: Change-Prop: Definition rerender bug - don't react to revision change [puppet] - 10https://gerrit.wikimedia.org/r/300442 
[23:02:59] <greg-g>	 how is it 4pm already?
[23:03:02] <addshore>	 *waves*
[23:04:39] <MaxSem>	 okay, sent all the extension patches to zuul
[23:04:42] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey)
[23:05:40] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore)
[23:06:14] <grrrit-wm>	 (03PS4) 10MaxSem: RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore)
[23:06:22] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore)
[23:06:35] <MaxSem>	 did I already mention how I hate this new setting?
[23:06:45] <greg-g>	 which?
[23:06:55] <MaxSem>	 must rebase before merging
[23:07:06] <grrrit-wm>	 (03Merged) 10jenkins-bot: RevisionSlider enables: dewiki, hewiki, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298933 (https://phabricator.wikimedia.org/T140232) (owner: 10Addshore)
[23:07:56] <MaxSem>	 addshore, pulled on mw1099
[23:08:00] <addshore>	 checking
[23:08:40] <addshore>	 looks good MaxSem 
[23:09:25] <mutante>	 ostriches:  heapLimit = 20g
[23:09:32] <mutante>	 is that it (and a lot ?)
[23:09:49] <James_F>	 Gah, laptop is picking an unfortunate time to reboot.
[23:09:50] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/298933/ (duration: 00m 29s)
[23:09:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:10:00] <ostriches>	 mutante: Yeah that's it. And nah it's not a lot :P
[23:10:09] <MaxSem>	 addshore, deployed
[23:10:15] <addshore>	 *checks*
[23:10:34] <addshore>	 looks good!
[23:10:44] <MaxSem>	 \m/
[23:11:48] <addshore>	 thanks MaxSem !
[23:12:06] * aude waves
[23:12:10] <Dereckson>	 Hi.
[23:12:14] * addshore wave toward aude
[23:12:21] <grrrit-wm>	 (03PS4) 10MaxSem: Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson)
[23:12:22] * aude goes to enable revisionslider on arwiki and dewiki
[23:12:29] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson)
[23:12:51] <addshore>	 aude, woo! :)
[23:12:52] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Catch liberty designate.conf up to the state of the art. [puppet] - 10https://gerrit.wikimedia.org/r/300444 
[23:13:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson)
[23:13:37] <Dereckson>	 ebernhardson: thanks for cherry-picking Fix Searcher::$searchContext visibility to wmf11 :)
[23:13:43] <ebernhardson>	 Dereckson: np
[23:14:03] <MaxSem>	 jdlrobson, pulled on mw1099
[23:14:09] <jdlrobson>	 looking
[23:15:04] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Catch liberty designate.conf up to the state of the art. [puppet] - 10https://gerrit.wikimedia.org/r/300444 (owner: 10Andrew Bogott)
[23:16:41] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2485842 (10Neil_P._Quinn_WMF) @Dzahn, thank you!  One question: as far as I can tell, the patch creates a new shell account for Helen...
[23:19:06] <Amir1>	 !log restarting uwsgi and celery for ores in scb 1001
[23:19:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:19:18] <jdlrobson>	 looks good MaxSem 
[23:20:29] <logmsgbot>	 !log maxsem@tin Synchronized dblists/wikidatadescriptions.dblist: https://gerrit.wikimedia.org/r/#/c/299615/ (duration: 00m 24s)
[23:20:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:21:09] <Amir1>	 !log restarting uwsgi and celery for ores in scb1002
[23:21:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:21:58] <grrrit-wm>	 (03PS1) 10Dzahn: gerrit: up heap size limit from 20GB to 28GB [puppet] - 10https://gerrit.wikimedia.org/r/300446 (https://phabricator.wikimedia.org/T141064) 
[23:22:23] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/299615/ (duration: 00m 29s)
[23:22:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:22:34] <MaxSem>	 jdlrobson, ^
[23:22:42] <jdlrobson>	 MaxSem: checking once more
[23:23:26] <jdlrobson>	 sweet. That ones done
[23:24:28] <grrrit-wm>	 (03PS2) 10MaxSem: Lazy load images+references on Russian Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299619 (https://phabricator.wikimedia.org/T140197) (owner: 10Jdlrobson)
[23:24:53] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Lazy load images+references on Russian Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299619 (https://phabricator.wikimedia.org/T140197) (owner: 10Jdlrobson)
[23:25:29] <grrrit-wm>	 (03Merged) 10jenkins-bot: Lazy load images+references on Russian Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299619 (https://phabricator.wikimedia.org/T140197) (owner: 10Jdlrobson)
[23:26:03] <MaxSem>	 jdlrobson, pulled on mw1099
[23:26:09] <jdlrobson>	 MaxSem: checking
[23:26:30] <jdlrobson>	 MaxSem: and verified!
[23:27:18] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/299619/ (duration: 00m 24s)
[23:27:20] <MaxSem>	 jdlrobson, ^
[23:27:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:27:50] <jdlrobson>	 aawesome! thanks Max :)
[23:29:36] <MaxSem>	 Pchelolo and ebernhardson, pulled on mw1099
[23:29:43] <Pchelolo>	 MaxSem: checking
[23:30:18] <ebernhardson>	 MaxSem: mine isn't really testable, it only effects job queue
[23:30:32] <MaxSem>	 cheater!
[23:31:43] <Pchelolo>	 MaxSem: tested all I could, looks ok. 
[23:32:44] <logmsgbot>	 !log maxsem@tin Synchronized php-1.28.0-wmf.11/extensions/EventBus/: https://gerrit.wikimedia.org/r/#q,300332,n,z (duration: 00m 26s)
[23:32:46] <MaxSem>	 Pchelolo, ^
[23:32:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:33:00] <Pchelolo>	 thank you MaxSem, I'll monitor the logs
[23:33:39] <James_F>	 MaxSem: Laptop is now "calculating" how long it'll be offline. :-(
[23:34:25] <logmsgbot>	 !log maxsem@tin Synchronized php-1.28.0-wmf.11/extensions/CirrusSearch/: https://gerrit.wikimedia.org/r/#q,300430,n,z https://gerrit.wikimedia.org/r/#q,300436,n,z (duration: 00m 32s)
[23:34:27] <MaxSem>	 ebernhardson, ^
[23:34:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:34:54] <ebernhardson>	 MaxSem: thanks, i'll keep an eye on the logs
[23:35:08] <MaxSem>	 James_F, drop by
[23:35:35] <Pchelolo>	 MaxSem: all look great, thank you
[23:35:56] <grrrit-wm>	 (03PS2) 10MaxSem: Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester)
[23:36:11] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester)
[23:36:42] <grrrit-wm>	 (03PS6) 10Dzahn: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad)
[23:36:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable ShortUrl on Urdu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester)
[23:37:35] <ori>	 !log Restarted statsv on hafnium (cc Krinkle). 'gaierror: [Errno -3] Temporary failure in name resolution'
[23:37:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:39:42] <MaxSem>	 !log created ShortUrl tables on urwiki
[23:39:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:41:24] <MaxSem>	 !log on tin: ran mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=urwiki
[23:41:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:43:01] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] Increase elasticsearch heap on logstash routing instances [puppet] - 10https://gerrit.wikimedia.org/r/300440 (https://phabricator.wikimedia.org/T141063) (owner: 10EBernhardson)
[23:44:50] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Editing-Analysis, 13Patch-For-Review: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2485889 (10Dzahn) No, you are right about that. Just that the creation of the user and adding it to groups has to be in separate patc...
[23:46:06] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#q,298344,n,z (duration: 00m 24s)
[23:46:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:46:37] <grrrit-wm>	 (03PS1) 10Ppchelko: Change-Prop: Revert the revert - ignore bots on ORES [puppet] - 10https://gerrit.wikimedia.org/r/300450 
[23:47:38] <mutante>	 jouncebot: status
[23:49:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad)
[23:49:19] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 24s)
[23:49:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:50:16] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 
[23:50:24] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 (owner: 10MaxSem)
[23:50:34] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 
[23:50:42] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 (owner: 10MaxSem)
[23:51:16] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 
[23:51:24] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 (owner: 10MaxSem)
[23:51:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 (owner: 10MaxSem)
[23:51:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 (owner: 10MaxSem)
[23:51:41] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 
[23:51:51] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 (owner: 10MaxSem)
[23:52:30] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 
[23:52:37] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 (owner: 10MaxSem)
[23:52:43] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 (owner: 10MaxSem)
[23:52:45] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 
[23:52:50] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 (owner: 10MaxSem)
[23:52:58] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 
[23:53:05] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 (owner: 10MaxSem)
[23:53:14] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 (owner: 10MaxSem)
[23:53:16] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 
[23:53:24] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 (owner: 10MaxSem)
[23:53:26] <mutante>	 swat .. swat
[23:53:35] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 
[23:53:44] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 (owner: 10MaxSem)
[23:53:46] <Amir1>	 !log deploying 2d9817b to ores in scb nodes
[23:53:49] <Reedy>	 How many patches? :P
[23:53:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:54:01] <MaxSem>	 only ten.. I know, lame
[23:54:04] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 (owner: 10MaxSem)
[23:54:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 (owner: 10MaxSem)
[23:54:12] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 
[23:54:18] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 (owner: 10MaxSem)
[23:54:20] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 (owner: 10MaxSem)
[23:54:30] <MaxSem>	 a full cleanup would be like 50
[23:54:41] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 
[23:54:47] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem)
[23:54:57] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 
[23:55:05] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 (owner: 10MaxSem)
[23:55:15] <grrrit-wm>	 (03PS2) 10MaxSem: Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 
[23:55:18] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Rename oslo.config to oslo_config [puppet] - 10https://gerrit.wikimedia.org/r/300453 
[23:55:19] <icinga-wm>	 PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: puppet fail
[23:55:25] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 (owner: 10MaxSem)
[23:55:30] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 (owner: 10MaxSem)
[23:55:37] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 (owner: 10MaxSem)
[23:55:42] <James_F>	 Reedy: All the patches.
[23:55:43] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 (owner: 10MaxSem)
[23:56:16] <kaldari>	 MaxSem: I bet you know the answer to this: https://phabricator.wikimedia.org/T139552#2484944
[23:56:49] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 (owner: 10MaxSem)
[23:56:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 (owner: 10MaxSem)
[23:57:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 (owner: 10MaxSem)
[23:58:09] <icinga-wm>	 PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: puppet fail
[23:58:21] <mutante>	 waiting for integration.wikimedia.org ...
[23:58:43] <MaxSem>	 kaldari, yes
[23:58:50] <mutante>	 i guess it's busy with Max patche s:)
[23:58:52] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Rename oslo.config to oslo_config [puppet] - 10https://gerrit.wikimedia.org/r/300453