[00:06:55] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki metawiki --delete
[00:06:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:21:39] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki ruwiki (T45917)
[00:21:40] <stashbot>	 T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917
[00:21:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:25:13] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm)
[00:28:25] <grrrit-wm>	 (03PS1) 10Krinkle: wmfstatic: Change message "Invalid path type" to "Invalid file type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281383 
[00:28:38] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] wmfstatic: Change message "Invalid path type" to "Invalid file type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281383 (owner: 10Krinkle)
[00:29:04] <grrrit-wm>	 (03Merged) 10jenkins-bot: wmfstatic: Change message "Invalid path type" to "Invalid file type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281383 (owner: 10Krinkle)
[00:29:12] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm)
[00:29:37] <grrrit-wm>	 (03Merged) 10jenkins-bot: Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm)
[00:30:15] <logmsgbot>	 !log krinkle@tin Synchronized w/static.php: (no message) (duration: 00m 35s)
[00:30:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:45:27] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[00:46:38] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[00:47:32] <Luke|away>	 Krinkle: Did you synched the right file at https://gerrit.wikimedia.org/r/281318? Seems like it is currently not live at test2
[00:47:44] <Krinkle>	 Luke|away: I didn't sync yet
[00:47:51] <Luke|away>	 ah, ok
[00:48:08] <Luke|away>	 I jsut wondered, because the task is closed, but the change is not live ;)
[00:48:20] <Krinkle>	 Syncing now
[00:48:26] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[00:49:07] <Krinkle>	 Luke|away: If you enable ChromeWikimediaDebug and set to mw1017, you'll see the change live
[00:49:13] <Krinkle>	 Deploying to the main cluster now
[00:49:17] <logmsgbot>	 !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 29s)
[00:49:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:49:24] <Luke|away>	 thanks
[00:49:48] <Krinkle>	 See https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug
[00:50:08] <Krinkle>	 Most of the time the canary server is the same as the rest. But during deployment it can be different (either next version, or some other temporary experiment)
[00:50:46] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[02:19:28] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0]
[02:25:08] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 11m 45s)
[02:25:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:33:39] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Apr  4 02:33:39 UTC 2016 (duration 8m 31s)
[02:33:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:58:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.002 second response time
[03:02:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 73081 bytes in 0.115 second response time
[03:05:38] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[03:06:07] <icinga-wm>	 PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100%
[03:09:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[03:09:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[03:10:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[03:11:29] <matt_flaschen>	 YuviPanda, Phabricator is 503'ing for me.
[03:15:12] <greg-g>	 matt_flaschen: 03:06 < icinga-wm> PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100%
[03:15:17] <greg-g>	 so, yeah :/
[03:15:48] <greg-g>	 !log Phabricator is down
[03:15:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:36:52] <Josve05a>	 :/
[03:38:29] <Amir1>	 :(
[03:40:00] <Krenair>	 do ops get paged when hosts go down? I don't know how icinga is set up for this sort of thing
[03:47:57] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 177759 MB (3% inode=99%)
[04:02:12] <TimStarling>	 depends on the service
[04:08:38] <TimStarling>	 it says it's powered off
[04:11:44] <TimStarling>	 http://paste.tstarling.com/p/qbrBRb.html
[04:12:42] <Krenair>	 the system powered itself off?
[04:12:46] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0]
[04:12:46] <TimStarling>	 apparently
[04:12:51] <TimStarling>	 I guess I will try turning it on
[04:13:17] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0]
[04:13:44] <TimStarling>	 !log attempting to turn iridium back on via drac. "getraclog" says it powered itself off after resetting four times
[04:13:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:16:00] <TimStarling>	 I'm on the serial console now, it is booting
[04:16:26] <icinga-wm>	 RECOVERY - Host iridium is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms
[04:16:50] <TimStarling>	 well, phab is back up
[04:18:47] <TimStarling>	 !log iridium came back up, but mcelog reports high CPU temperature prior to the shutdown
[04:18:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:19:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy
[04:22:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[04:22:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[04:23:16] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 176743 MB (3% inode=99%)
[04:30:17] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 176537 MB (3% inode=99%)
[04:52:27] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2176955 (10Peachey88)
[04:58:57] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[05:02:07] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[05:19:17] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 174398 MB (3% inode=99%)
[05:31:43] <wikibugs>	 6Operations, 6Revision-Scoring-As-A-Service, 13Patch-For-Review: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#2176972 (10Ladsgroup)
[05:33:01] <wikibugs>	 6Operations, 6Revision-Scoring-As-A-Service, 13Patch-For-Review: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#1801807 (10Ladsgroup) a:3Ladsgroup
[05:37:10] <grrrit-wm>	 (03PS2) 10Ladsgroup: ores: do git clone in staging [puppet] - 10https://gerrit.wikimedia.org/r/281228 
[05:46:07] <grrrit-wm>	 (03PS44) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (https://phabricator.wikimedia.org/T130404) 
[06:34:17] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:07] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:54:03] <moritzm>	 !log installing apt bugfix updates
[06:54:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:56:57] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[06:57:56] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:07] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 175589 MB (3% inode=99%)
[07:21:34] <wikibugs>	 6Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2177064 (10KartikMistry)
[07:44:37] <icinga-wm>	 PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail
[07:48:27] <wikibugs>	 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2177080 (10MoritzMuehlenhoff)
[07:48:51] <grrrit-wm>	 (03Abandoned) 10Reedy: Add Newsletter extension to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281249 (https://phabricator.wikimedia.org/T127297) (owner: 1001tonythomas)
[07:49:51] <wikibugs>	 6Operations, 10Incident-20150205-SiteOutage, 7Availability, 13Patch-For-Review: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2177082 (10Joe) p:5High>3Normal a:5Joe>3None
[07:55:53] <grrrit-wm>	 (03CR) 10Mxn: [C: 031] "Sorry for the delay. Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (https://phabricator.wikimedia.org/T130514) (owner: 10Thcipriani)
[08:12:47] <icinga-wm>	 RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:25:33] <wikibugs>	 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2177093 (10ema) Mostly out of curiosity, I've checked which protocols are supported by other top-10 websites by looking at NPN responses:   | google.com / youtube.com | h2, spdy/3.1, htt...
[08:35:00] <grrrit-wm>	 (03PS1) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[08:38:54] <wikibugs>	 6Operations, 10Incident-20150205-SiteOutage, 7Availability, 13Patch-For-Review: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#2177112 (10Joe) I release this ticket as:  # I don't have ideas/time to work on it # This is not happenin...
[08:39:56] <wikibugs>	 6Operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#2177114 (10Joe)
[08:39:58] <wikibugs>	 6Operations, 7Puppet, 10Salt, 13Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#2177113 (10Joe) 5Open>3Resolved
[08:47:25] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] "These changes have been running in prod on the maps cluster for a few days without issues." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280612 (owner: 10BBlack)
[08:49:53] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe)
[08:50:04] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280198 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey)
[08:50:27] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Remove loglines cache to mitigate a possible memory leak. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey)
[08:50:46] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Code cleanup [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280294 (owner: 10BBlack)
[08:50:46] <moritzm>	 !log installing gnupg updates
[08:50:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:51:02] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Remove format.key feature [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280295 (owner: 10BBlack)
[08:51:17] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] remove a couple of inline attrs [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280377 (owner: 10BBlack)
[08:51:24] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177141 (10Joe)
[08:51:33] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] split lp->match allocation from lp itself [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280378 (owner: 10BBlack)
[08:51:49] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] remove lp->tmpbuf, bump scratch default to 4MB [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280379 (owner: 10BBlack)
[08:52:10] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] refactor match_assign/scratch/parser stuff [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280380 (owner: 10BBlack)
[08:52:32] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] minor cleanups from cppcheck [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280381 (owner: 10BBlack)
[08:56:55] <wikibugs>	 6Operations, 7Puppet, 10Wikimedia-Apache-configuration: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177147 (10Joe)
[08:56:57] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177146 (10Joe)
[08:57:30] <wikibugs>	 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2177149 (10Joe)
[08:57:32] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe)
[09:00:19] <grrrit-wm>	 (03PS6) 10Volans: DB: Expose Puppet SSL certs and generate CA cert [puppet] - 10https://gerrit.wikimedia.org/r/279596 (https://phabricator.wikimedia.org/T111654) 
[09:00:46] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2177150 (10Joe)
[09:02:14] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe)
[09:02:16] <wikibugs>	 6Operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2177164 (10Joe)
[09:07:44] <wikibugs>	 6Operations, 7Puppet, 10MediaWiki-General-or-Unknown: Profile and reduce the puppet execution time on the appservers - https://phabricator.wikimedia.org/T131750#2177166 (10Joe)
[09:07:51] <wikibugs>	 6Operations, 7Puppet, 10MediaWiki-General-or-Unknown: Profile and reduce the puppet execution time on the appservers - https://phabricator.wikimedia.org/T131750#2177166 (10Joe) p:5Triage>3Normal
[09:08:01] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2177126 (10Joe) p:5Triage>3Normal
[09:10:23] <volans>	 !log Disabling Puppet on cluster mysql and parsercache to merge and test change 279596 on db2040, T111654
[09:10:24] <stashbot>	 T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654
[09:10:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:13:51] <grrrit-wm>	 (03CR) 10Volans: [C: 032] DB: Expose Puppet SSL certs and generate CA cert [puppet] - 10https://gerrit.wikimedia.org/r/279596 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans)
[09:23:39] <wikibugs>	 6Operations, 10Salt: take steps outlined at techops offiste to (try to) address salt reliability - https://phabricator.wikimedia.org/T115292#2177187 (10ArielGlenn)
[09:23:41] <wikibugs>	 7Blocked-on-Operations, 6Operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#2177188 (10ArielGlenn)
[09:23:43] <wikibugs>	 6Operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#2177185 (10ArielGlenn) 5Open>3Resolved And it's done, even with the blocking task listed about parsoid. They have a workaround they have been using for months.  Thanks Joe for fixing up wmf...
[09:24:04] <grrrit-wm>	 (03PS4) 10Gehel: Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson)
[09:25:42] <grrrit-wm>	 (03CR) 10Gehel: "I rebased that change as half of it was already on the production branch. It now becomes a simple one line change (not that it was overly " [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson)
[09:26:51] <volans>	 !log Re-enabling Puppet on cluster mysql and parsercache to deploy change 279596, T111654
[09:26:52] <stashbot>	 T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654
[09:26:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:33:18] <godog>	 !log deploy restbase ba39d2bcd2f5 to restbase2004 before repooling
[09:33:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:37:23] <godog>	 !log repool restbase2004
[09:37:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:39:22] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: make base class trusty and forward only [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) 
[09:39:24] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281408 (https://phabricator.wikimedia.org/T126310) 
[09:39:26] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) 
[09:39:30] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/281410 (https://phabricator.wikimedia.org/T126310) 
[09:39:33] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop compatibility with precise, mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281411 (https://phabricator.wikimedia.org/T126310) 
[09:39:35] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281412 (https://phabricator.wikimedia.org/T126310) 
[09:39:52] * _joe_ spring cleaning
[09:41:41] <godog>	 !log depool restbase2003 before raid expansion
[09:41:43] <grrrit-wm>	 (03PS1) 10Elukey: Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) 
[09:41:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:53:03] <elukey>	 !log de-pooled aqs1001.eqiad.wmnet as pre-step for nodejs upgrade
[09:53:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:53:39] <joal>	 elukey: nerver mind :)
[09:54:40] <ema>	 !log nginx rolling restart for openssl upgrade: cp1046, cp1052, cp1068, cp1071, cp1099
[09:54:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:57:26] <wikibugs>	 7Blocked-on-Operations, 6Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2177214 (10fgiunchedi) @eevans, yes, restbase2003 is expanding its raid0 ATM
[09:57:31] <godog>	 !log start expanding raid0 on restbase2003
[09:57:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:58:53] <wikibugs>	 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2177215 (10hashar) p:5High>3Normal
[09:59:47] <moritzm>	 !log installing pcre3 updates
[09:59:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:00:04] <wikibugs>	 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2158668 (10hashar) 5Open>3stalled I have updated the task detail to: * point to @akosiaris comment explaining how to resume the queue processing. * Mention 2.8.4 will fix it  This...
[10:01:35] <grrrit-wm>	 (03PS1) 10ArielGlenn: add scap config file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:02:08] <grrrit-wm>	 (03PS2) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:03:08] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn)
[10:07:38] <grrrit-wm>	 (03CR) 10Hashar: "I got rid of grunt-cli local install on all slaves a few minutes ago." [puppet] - 10https://gerrit.wikimedia.org/r/280974 (https://phabricator.wikimedia.org/T124474) (owner: 10Hashar)
[10:08:47] <icinga-wm>	 PROBLEM - Disk space on restbase2004 is CRITICAL: DISK CRITICAL - free space: /srv 177995 MB (3% inode=99%)
[10:11:29] <godog>	 taking a look ^
[10:13:26] <grrrit-wm>	 (03PS2) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[10:14:38] <grrrit-wm>	 (03PS3) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:15:52] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn)
[10:17:14] <grrrit-wm>	 (03PS4) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:18:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn)
[10:19:26] <icinga-wm>	 RECOVERY - Disk space on restbase2004 is OK: DISK OK
[10:19:49] <grrrit-wm>	 (03PS5) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:20:18] <grrrit-wm>	 (03PS2) 10Elukey: Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) 
[10:21:14] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn)
[10:22:48] <grrrit-wm>	 (03PS6) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:23:01] <godog>	 !log reduce reserved blocks for /srv on restbase2004
[10:23:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:23:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn)
[10:26:04] <grrrit-wm>	 (03PS7) 10ArielGlenn: set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 
[10:26:07] <apergos>	 mondays
[10:26:08] <apergos>	 hate em
[10:28:39] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] set up for addition of scap config file on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281416 (owner: 10ArielGlenn)
[10:35:59] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop the HHVM define and mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281418 (https://phabricator.wikimedia.org/T126310) 
[10:37:18] <_joe_>	 anyone up for reviewing ^^?
[10:37:19] <_joe_>	 :P
[10:38:10] <grrrit-wm>	 (03CR) 10Alex Monk: "does not match existing code style" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[10:38:21] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey)
[10:38:31] <grrrit-wm>	 (03CR) 10Elukey: [V: 032] Add info to the varnishkafka README after the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/281413 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey)
[10:42:26] <elukey>	 !log re-pooled aqs1001.eqiad (no node upgrade, need more info about restbase)
[10:42:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:47:10] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::web: drop apache 2.2 support [puppet] - 10https://gerrit.wikimedia.org/r/281419 (https://phabricator.wikimedia.org/T126310) 
[10:48:22] <wikibugs>	 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177315 (10elukey)
[10:48:24] <wikibugs>	 6Operations, 6Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4  for analytics - https://phabricator.wikimedia.org/T124278#2177313 (10elukey) 5Open>3Resolved Code merged by ema, plus the varnish maps cluster has been running with vk for days without triggering any...
[10:48:58] <icinga-wm>	 PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[10:49:26] <icinga-wm>	 PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:54:18] <icinga-wm>	 RECOVERY - DPKG on labmon1001 is OK: All packages OK
[11:08:25] <kart_>	 apergos: hi. r: https://phabricator.wikimedia.org/T127793 script for dump is available in ContentTranslation and default parameters are fine as of now.
[11:08:36] <kart_>	 apergos: let me know any more info we need.
[11:08:40] <apergos>	 thank you.
[11:08:50] <apergos>	 nag me in two day splease if you haven't seen anything on the ticket.
[11:09:13] <kart_>	 apergos: we can 'split-at' if dump goes larger.
[11:09:22] <kart_>	 apergos: sure. thanks.
[11:09:55] <kart_>	 apergos: just need to finalize frequency. we can start with weekly, wait for dump size and decide.
[11:15:41] <grrrit-wm>	 (03PS3) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[11:16:27] <icinga-wm>	 RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[11:18:01] <grrrit-wm>	 (03CR) 10Urbanecm: "@Alex: What code style do you mean? I tried to do my best. If you meant the indent, I fixed it in patch 3. Sorry for it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[11:21:02] <grrrit-wm>	 (03CR) 10Alex Monk: "The rest of the file has spaces around the parameters to array() and spaces after the //. These additions don't" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[11:29:23] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) 
[11:30:16] <grrrit-wm>	 (03PS4) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[11:30:31] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi)
[11:30:47] <grrrit-wm>	 (03CR) 10Urbanecm: "I fixed it. Is it ok?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[12:05:53] <wikibugs>	 6Operations, 7Puppet, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2177350 (10Joe) p:5Triage>3Normal a:3Joe
[12:07:33] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2177353 (10Joe)
[12:08:06] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2177150 (10Joe) p:5Triage>3Normal a:3Joe
[12:10:47] <icinga-wm>	 PROBLEM - kartotherian on maps-test2004 is CRITICAL: Connection refused
[12:11:56] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2177370 (10Joe)
[12:12:07] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2177370 (10Joe) a:5Joe>3None
[12:14:07] <grrrit-wm>	 (03PS1) 10ArielGlenn: set up for keyholder for dumps deployment from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/281425 
[12:15:07] <icinga-wm>	 PROBLEM - tilerator on maps-test2004 is CRITICAL: Connection refused
[12:15:27] <icinga-wm>	 PROBLEM - tileratorui on maps-test2004 is CRITICAL: Connection refused
[12:16:07] <icinga-wm>	 RECOVERY - kartotherian on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 949 bytes in 0.083 second response time
[12:16:22] <_joe_>	 what happened to maps-test2004?
[12:16:57] <icinga-wm>	 RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.088 second response time
[12:17:17] <icinga-wm>	 RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 304 bytes in 0.089 second response time
[12:18:46] <icinga-wm>	 PROBLEM - RAID on stat1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded)
[12:30:44] <elukey>	 outch --^
[12:32:41] <grrrit-wm>	 (03PS1) 10ArielGlenn: setup for dumps deployment public key on snapshot hosts for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281428 
[12:34:27] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0]
[12:43:31] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2176955 (10Cmjohnson) There has been several severs with cpu heat issues over the last few months. Re-applying thermal paste has been an effective fix.  Iridium i...
[12:43:36] <grrrit-wm>	 (03CR) 10Dereckson: Enable Translate extension on uawikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[12:43:53] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn)
[12:45:08] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515)
[12:47:55] <grrrit-wm>	 (03PS4) 10BBlack: tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 
[12:49:18] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 (owner: 10BBlack)
[12:50:50] <grrrit-wm>	 (03CR) 10Dereckson: "@awight What's the exact goal of this change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis)
[12:52:45] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "Misleading commit message, lack of clear rationale." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis)
[12:53:06] <wikibugs>	 6Operations, 10Traffic, 10fundraising-tech-ops, 13Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#2177434 (10BBlack) Note this should get resolved via T130414 's https://gerrit.wikimedia.org/r/#/c/278353
[12:55:58] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:55:58] <icinga-wm>	 PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:56:47] <icinga-wm>	 PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:56:47] <icinga-wm>	 PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:57:47] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:57:55] <bblack>	 annoying :P
[13:00:08] <wikibugs>	 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2177437 (10ema) 5Open>3Resolved
[13:03:33] <grrrit-wm>	 (03PS2) 10ArielGlenn: set up for keyholder for dumps deployment from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/281425 
[13:03:54] <grrrit-wm>	 (03PS1) 10BBlack: tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 
[13:04:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 (owner: 10BBlack)
[13:05:34] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] set up for keyholder for dumps deployment from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/281425 (owner: 10ArielGlenn)
[13:05:52] <grrrit-wm>	 (03PS2) 10BBlack: tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 
[13:06:20] <grrrit-wm>	 (03PS2) 10ArielGlenn: setup for dumps deployment public key on snapshot hosts for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281428 
[13:08:03] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] setup for dumps deployment public key on snapshot hosts for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281428 (owner: 10ArielGlenn)
[13:14:05] <grrrit-wm>	 (03PS3) 10BBlack: tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 
[13:14:12] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: add parent directory for sysd unit frags [puppet] - 10https://gerrit.wikimedia.org/r/281432 (owner: 10BBlack)
[13:16:27] <icinga-wm>	 RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[13:18:07] <icinga-wm>	 PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:18:57] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:19:17] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:19:17] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[13:19:57] <icinga-wm>	 RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:20:38] <icinga-wm>	 RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:21:06] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:21:06] <icinga-wm>	 RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[13:21:57] <icinga-wm>	 PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:22:06] <icinga-wm>	 PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:22:48] <icinga-wm>	 PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:22:49] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:22:50] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:23:46] <icinga-wm>	 RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:23:47] <icinga-wm>	 RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:23:57] <bblack>	 puppet should be fine on caches, this is just the delayed fallout of the "bad patch" -> "disable puppet" -> "merge fix" -> "enable + run puppet" cycle, which ends up notifying of failures along the way
[13:24:27] <icinga-wm>	 PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:24:27] <icinga-wm>	 RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:24:36] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:24:36] <icinga-wm>	 RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:24:36] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:26:17] <icinga-wm>	 RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:26:26] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:47] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[13:28:55] <grrrit-wm>	 (03CR) 10Luke081515: "reckeck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[13:29:07] <grrrit-wm>	 (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[13:30:22] <wikibugs>	 6Operations, 10ops-eqiad: stat1002 broken disk causing degraded RAID array - https://phabricator.wikimedia.org/T131758#2177452 (10elukey)
[13:32:01] <grrrit-wm>	 (03PS1) 10ArielGlenn: enable keyholder and scap cfg setup on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281435 
[13:39:04] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177485 (10ema)
[13:40:24] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] enable keyholder and scap cfg setup on deployment servers for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281435 (owner: 10ArielGlenn)
[13:40:36] <ema>	 !log nginx rolling restart for openssl upgrade on cache hosts
[13:40:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:46:13] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760#2177507 (10ema) p:5Triage>3Normal
[13:47:18] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2177508 (10Joe) p:5Triage>3Normal a:3Joe
[13:47:38] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix typo in name of private key file for dumps dpeloyment [puppet] - 10https://gerrit.wikimedia.org/r/281438 
[13:48:12] <Nemo_bis>	  ciao ema
[13:48:20] <ema>	 hey Nemo_bis 
[13:48:30] <Nemo_bis>	 non avevo ancora mremorizzato il tuo nick
[13:49:01] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix typo in name of private key file for dumps dpeloyment [puppet] - 10https://gerrit.wikimedia.org/r/281438 (owner: 10ArielGlenn)
[13:49:06] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures
[13:50:35] <grrrit-wm>	 (03PS1) 10BBlack: varnish+statsd: refactor classes, move rls to text-only [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) 
[13:51:28] <wikibugs>	 6Operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2177518 (10Joe)
[13:51:31] <wikibugs>	 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2177517 (10Joe)
[13:51:51] <elukey>	 PROBLEM - too many italian words registered in the channel :D
[13:52:03] <volans>	 indeed
[13:52:26] <grrrit-wm>	 (03CR) 10BBlack: "@ori: it seems like rls should be text-cluster-only (where load.php lives). I was noting it seems to log some basic headers for all reqs," [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack)
[13:52:37] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[13:53:48] <icinga-wm>	 PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures
[13:54:27] <wikibugs>	 6Operations, 6Labs, 13Patch-For-Review: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2177525 (10chasemp) p:5Triage>3Low
[13:57:28] <grrrit-wm>	 (03CR) 10Base: Enable Translate extension on uawikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[13:58:40] <grrrit-wm>	 (03PS1) 10ArielGlenn: start new snapshot role with the basics at first: scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/281440 
[14:00:12] <grrrit-wm>	 (03CR) 10Ema: [C: 031] "Checked with pcc, works as advertised." [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack)
[14:00:54] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] start new snapshot role with the basics at first: scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/281440 (owner: 10ArielGlenn)
[14:01:49] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2177536 (10chasemp) p:5Triage>3Normal
[14:06:48] <wikibugs>	 6Operations, 6Labs: Can't create account "Bishoy Camel" (user with a former SVN account not migrated) - https://phabricator.wikimedia.org/T128833#2177556 (10chasemp) p:5Triage>3High
[14:07:30] <wikibugs>	 6Operations, 6Labs, 10Monitoring, 10Tool-Labs: Make icinga-wm report Tools homepage check at #wikimedia-labs, too - https://phabricator.wikimedia.org/T128716#2177558 (10chasemp) p:5Triage>3Low
[14:07:37] <wikibugs>	 6Operations, 6Labs, 10Monitoring, 10Tool-Labs: Add other Tools administrators to the Icinga notification group - https://phabricator.wikimedia.org/T128715#2177559 (10chasemp) p:5Triage>3Normal
[14:07:59] <wikibugs>	 6Operations, 6Labs, 10Tool-Labs: Get rid of Tool Labs home page check from shinken - https://phabricator.wikimedia.org/T128615#2177561 (10chasemp) p:5Triage>3Normal
[14:08:40] <grrrit-wm>	 (03PS1) 10ArielGlenn: enable new snapshot role on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/281442 
[14:10:35] <grrrit-wm>	 (03PS5) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[14:13:05] <grrrit-wm>	 (03PS6) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[14:14:07] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] enable new snapshot role on snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/281442 (owner: 10ArielGlenn)
[14:17:31] <apergos>	 if icinga whines aobut puppet errors on snapshot1005, please ignore, I'm working on it
[14:20:36] <icinga-wm>	 RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[14:20:37] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:23:58] <icinga-wm>	 PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[14:26:43] <grrrit-wm>	 (03PS2) 10Ldaptestaccount123: Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) (owner: 10BryanDavis)
[14:27:52] <grrrit-wm>	 (03CR) 10Dereckson: "Logic looks good to me." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[14:27:53] <paravoid>	 sorry what?
[14:27:59] <paravoid>	 ldaptestaccount what now?
[14:28:36] <wikibugs>	 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177593 (10Papaul)
[14:30:59] <grrrit-wm>	 (03PS1) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) 
[14:31:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm)
[14:35:13] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177610 (10BBlack)
[14:35:44] <grrrit-wm>	 (03PS2) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) 
[14:35:45] <wikibugs>	 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177627 (10Papaul) Network ports information  WMF6404     ge-5/0/23   row A rack A5 WMF6405     ge-5/0/14   row C rack C5 WMF6406     ge-5/0/15   row C rack C5 WMF6407     ge-5/0/09   row D rack D5 WMF6408...
[14:36:22] <wikibugs>	 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177632 (10Papaul)
[14:36:43] <grrrit-wm>	 (03PS1) 10ArielGlenn: scap3 for dumps needs the scap dir group writeable by wikidev it seems [puppet] - 10https://gerrit.wikimedia.org/r/281445 
[14:38:04] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] scap3 for dumps needs the scap dir group writeable by wikidev it seems [puppet] - 10https://gerrit.wikimedia.org/r/281445 (owner: 10ArielGlenn)
[14:39:37] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: installserver: port squid3 changes for trusty/jessie [puppet] - 10https://gerrit.wikimedia.org/r/281447 (https://phabricator.wikimedia.org/T123733) 
[14:46:11] <elukey>	 !log de-pooled aqs1001.eqiad from the confd pool for nodejs upgrade
[14:46:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:47:54] <grrrit-wm>	 (03PS4) 10Rush: Tools: Add dev packages needed to compile python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis)
[14:48:11] <grrrit-wm>	 (03PS7) 10Urbanecm: Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) 
[14:48:57] <icinga-wm>	 RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys.
[14:49:53] <grrrit-wm>	 (03CR) 10Urbanecm: "@Dereckson: Fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[14:50:54] <grrrit-wm>	 (03CR) 10Rush: [C: 032] Tools: Add dev packages needed to compile python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis)
[14:51:01] <grrrit-wm>	 (03PS3) 10Rush: Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) (owner: 10BryanDavis)
[14:51:11] <grrrit-wm>	 (03CR) 10Rush: [C: 032 V: 032] Tools: Add mytop [puppet] - 10https://gerrit.wikimedia.org/r/272435 (https://phabricator.wikimedia.org/T58999) (owner: 10BryanDavis)
[14:52:09] <grrrit-wm>	 (03CR) 10Thcipriani: "One inline comment/question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac)
[14:53:36] <icinga-wm>	 PROBLEM - AQS root url on aqs1001 is CRITICAL: Connection refused
[14:54:08] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.0.123:9042 on aqs1001 is CRITICAL: Connection refused
[14:54:16] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.123, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[14:54:56] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection refused
[14:55:08] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[14:55:15] <urandom>	 !log Restarting restbase2004-a.codfw.wmnet (cancelling bootstrap of 2004-b)
[14:55:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:55:38] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /unique-devices/{p
[14:55:39] <icinga-wm>	 PROBLEM - Analytics Cassandra database on aqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon
[14:55:39] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: /pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /unique-devices/{p
[14:56:07] <icinga-wm>	 PROBLEM - cassandra service on aqs1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[14:56:37] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[14:57:08] <icinga-wm>	 RECOVERY - AQS root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.033 second response time
[14:57:27] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy
[14:57:28] <icinga-wm>	 RECOVERY - Analytics Cassandra database on aqs1001 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon
[14:57:28] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy
[14:57:48] <icinga-wm>	 RECOVERY - cassandra CQL 10.64.0.123:9042 on aqs1001 is OK: TCP OK - 0.008 second response time on port 9042
[14:57:56] <icinga-wm>	 RECOVERY - cassandra service on aqs1001 is OK: OK - cassandra is active
[14:57:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy
[14:58:37] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.007 second response time on port 9042
[15:00:05] <jouncebot>	 anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T1500).
[15:00:05] <jouncebot>	 bearND mdholloway: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:43] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177744 (10BBlack) On point 3 above (when does varnish send TE:chunked?), my best observations/code-searching indicate:  1. Obviously a do_stream of a chunked fetch is a chun...
[15:01:39] <bearND>	 hi
[15:01:48] <thcipriani>	 bearND: Hi, I can SWAT for you :)
[15:02:04] <bearND>	 thcipriani: thanks
[15:03:21] <grrrit-wm>	 (03PS1) 10Papaul: DNS: Adding mgmt DNS for spare pool servers Bug: T130941 [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) 
[15:04:23] <wikibugs>	 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2177752 (10fgiunchedi) on sunday restbase2004 ran out of disk space while bootstrapping 2004-b  ``` 12:...
[15:06:03] <wikibugs>	 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177758 (10Papaul)
[15:08:10] <wikibugs>	 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2177761 (10Papaul) a:5Papaul>3RobH @ Please update switch ports and resolve this task once complete. Thanks
[15:08:27] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[15:08:48] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177767 (10BBlack) And on point 2 above: since varnish seems to be smart about using TE:chunked only when the response length isn't easy to know, there's not much wiggle room...
[15:09:02] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2168861 (10ema) a:3ema
[15:11:17] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:13:27] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:15:21] <wikibugs>	 6Operations, 10ops-eqiad, 10netops: investigate why mr1-eqiad randomly rebooted - https://phabricator.wikimedia.org/T131379#2177774 (10faidon) 5Open>3declined I re-rebooted it from the console, as it wasn't able to read th SSH keys (!? the CF is maybe broken?) and hence sshd was unable to start. It works...
[15:18:40] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.19/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android production app: 50% [[gerrit:280957]] (duration: 00m 46s)
[15:18:41] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177789 (10chasemp) p:5Triage>3High
[15:18:43] <thcipriani>	 ^ bearND check please
[15:18:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:19:42] <bearND>	 thcipriani: looks good, thank you
[15:19:53] <thcipriani>	 bearND: cool, thanks for checking!
[15:21:49] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177807 (10chasemp) Some details from an email @tstarling sent out as a notice  ```...So I powered it up, and it came up, but /var/lo...
[15:23:24] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team, 15User-greg: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177828 (10chasemp) a:3greg tossing your way because of the nature of the issue and need for immediate feedback for...
[15:23:38] <grrrit-wm>	 (03PS12) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) 
[15:24:11] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177838 (10BBlack) Going into more detail on the current behaviors of misc and upload clusters:  **cache_misc** - regardless of tier/layer, it sets do_stream for objects >= 1...
[15:24:53] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team, 15User-greg: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177840 (10greg) yeah, stab in the dark guess is it might happen again tonight after the dumps (I assume?) run.  When...
[15:24:55] <grrrit-wm>	 (03PS1) 10ArielGlenn: add the dumps deploy pub key [puppet] - 10https://gerrit.wikimedia.org/r/281455 
[15:25:28] <grrrit-wm>	 (03PS2) 10ArielGlenn: add the dumps deploy pub key [puppet] - 10https://gerrit.wikimedia.org/r/281455 
[15:25:49] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177842 (10greg) a:5greg>3None
[15:38:23] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177962 (10BBlack) Also, confirmed that do_stream of a non-chunked fetch doesn't cause chunked response on cache_upload.
[15:39:11] <grrrit-wm>	 (03Abandoned) 10ArielGlenn: add the dumps deploy pub key [puppet] - 10https://gerrit.wikimedia.org/r/281455 (owner: 10ArielGlenn)
[15:39:55] <grrrit-wm>	 (03PS1) 10Ema: Misc cluster VCL: avoid name conflict between directors and probes [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) 
[15:40:54] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2177971 (10Cmjohnson) @greg Downtime Max 10 minutes but not even that long. Can do whenever you're ready
[15:41:53] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2177973 (10BBlack) Another input here: in the common case, it seems MediaWiki outputs content with TE:chunked, too.
[15:42:48] <Krenair>	 !log ran wikitech-static updates
[15:42:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:45:38] <elukey>	  !log aqs1001 re-added to the aqs pool (nodejs NOT upgraded due to issues with Cassandra)
[15:45:52] <greg-g>	 elukey: remove the leading space :)
[15:46:26] <elukey>	 greg-g: ah snap sorry! 
[15:49:05] <grrrit-wm>	 (03PS1) 10ArielGlenn: add dumps deployment key [puppet] - 10https://gerrit.wikimedia.org/r/281460 
[15:49:14] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on stat1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) Elukey Opened a phab task for DC Ops: https://phabricator.wikimedia.org/T131758
[15:49:34] <elukey>	 --^ sorry I forgot to ack icinga
[15:51:01] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "Looks awesome, cleans up a lot of cruft." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4)
[15:51:15] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add dumps deployment key [puppet] - 10https://gerrit.wikimedia.org/r/281460 (owner: 10ArielGlenn)
[15:53:11] <greg-g>	 !log 15:45 <    elukey>  !log aqs1001 re-added to the aqs pool (nodejs NOT upgraded due to issues with Cassandra)
[15:53:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:53:27] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[15:54:46] <wikibugs>	 6Operations, 10DBA, 6Labs, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#2178009 (10Dispenser)
[15:59:03] <elukey>	 greg-g: sorry I was doing 3 things at the time and didn't get why you told me to re-log, I probably need coffee :)
[15:59:16] <wikibugs>	 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2178020 (10MoritzMuehlenhoff) The following updates from jessie 8.4 have been deployed: pcre3 gnupg apt nettle giflib subversion unbound stress
[15:59:16] <elukey>	 (or do less things, or both)
[15:59:32] <elukey>	 thanks anyway :)
[16:01:01] <greg-g>	 :)
[16:04:47] <wikibugs>	 6Operations, 6Labs, 13Patch-For-Review: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2178026 (10Andrew) 5Open>3Resolved a:3Andrew
[16:06:19] <grrrit-wm>	 (03CR) 10ArielGlenn: "note that soon I'll have to add dumps_blahblah to this patchset (or to the merged code if it's merged before I get there)." [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4)
[16:16:24] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178070 (10mmodell) Thanks @dzahn for setting this up so quickly. I tested that and I wa...
[16:16:55] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178072 (10mmodell)
[16:27:43] <wikibugs>	 6Operations, 10Traffic, 7Varnish: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2178082 (10ema)
[16:29:44] <Luke|away>	 argh
[16:29:56] <Luke081515>	 wait a moment
[16:30:47] <Luke081515>	 that was it. Phabricator is up again since more than two hours, so I guess omeone forgot to change the topic?
[16:31:56] <chasemp>	 Luke081515: I'm sure yes thanks
[16:32:22] <Luke081515>	 np ;)
[16:32:34] <mafk>	 jouncebot next
[16:32:35] <jouncebot>	 In 3 hour(s) and 27 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2000)
[16:33:21] <Luke081515>	 mafk: Need to deploy e easy change? if it's so, I can do it for you at the evening SWAT
[16:33:31] <Luke081515>	 I have already two patches scheduled
[16:33:42] <mafk>	 Luke081515: well, I have an astwiki logo one
[16:33:48] <mafk>	 not sure if doing it today
[16:33:51] <Luke081515>	 which gerrit number?
[16:33:56] <mafk>	 lemme search
[16:34:34] <mafk>	 280445
[16:35:18] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0]
[16:35:38] <Luke081515>	 mafk: Sounds good. I you want, I can do it at evening SWAT
[16:35:53] <mafk>	 Luke081515: if you want, that'd be good
[16:36:04] <Luke081515>	 ok, I will sign it up
[16:36:08] <urandom>	 !log Restarting bootstrap of restbase2004.codfw.wmnet : T95253
[16:36:09] <stashbot>	 T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253
[16:36:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:36:14] <Luke081515>	 I already have two, so it's not a big deal ;)
[16:36:20] <mafk>	 :D thank you
[16:37:39] <grrrit-wm>	 (03PS1) 10Rush: toollabs elastic don't use nginx light [puppet] - 10https://gerrit.wikimedia.org/r/281464 (https://phabricator.wikimedia.org/T131644) 
[16:40:01] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Move network::checks to netops::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/281465 
[16:40:03] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 
[16:40:05] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) 
[16:41:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 (owner: 10Faidon Liambotis)
[16:41:53] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) (owner: 10Faidon Liambotis)
[16:43:00] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) 
[16:43:02] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 
[16:45:06] <mafk>	 Luke081515: I see that there's an inminent (?) migration from gerrit to differential for mediawiki-config. Is archanist the program we should use to commit to differential/diffussion?
[16:45:54] <mafk>	 T131418
[16:45:54] <stashbot>	 T131418: Migrate mediawiki-config to Differential - https://phabricator.wikimedia.org/T131418
[16:46:21] <grrrit-wm>	 (03PS2) 10Rush: toollabs elastic don't use nginx light [puppet] - 10https://gerrit.wikimedia.org/r/281464 (https://phabricator.wikimedia.org/T131644) 
[16:47:38] <wikibugs>	 6Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2178157 (10RobH) a:3RobH Discussed in the ops meeting, and its granted.  However, we need to chat with Sam about 3 phase power and how it operates so no misleading figures are disclosed.  So I'll se...
[16:54:28] <paladox>	 mafk: Yes
[16:56:15] <mafk>	 paladox: thank you
[16:56:23] <paladox>	 your welcome
[16:58:33] <grrrit-wm>	 (03CR) 10Rush: [C: 032] toollabs elastic don't use nginx light [puppet] - 10https://gerrit.wikimedia.org/r/281464 (https://phabricator.wikimedia.org/T131644) (owner: 10Rush)
[16:59:23] <halfak>	 hey _joe_.  Wanted to talk to you briefly about https://phabricator.wikimedia.org/T118495
[16:59:38] <halfak>	 ^ uwsgi restarts taking exactly 1 minutes and 30 seconds. 
[16:59:48] <halfak>	 We learned some things and have a proposed short-term fix
[17:02:49] <halfak>	 I'm jet lagging and going to turn into a pile of mush, so I'm going to head offline for now.  
[17:02:57] <halfak>	 Should be more normalish tomorrow. 
[17:02:58] <halfak>	 o/
[17:06:15] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2178278 (10greg) >>! In T131742#2177971, @Cmjohnson wrote: > @greg Downtime Max 10 minutes but not even that long. Can do whenever yo...
[17:06:27] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2178279 (10greg) a:3Cmjohnson
[17:12:57] <wikibugs>	 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178301 (10mmodell)
[17:14:42] <wikibugs>	 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178316 (10jeremyb-phone)
[17:15:07] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[17:15:53] <wikibugs>	 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178325 (10greg) See also: {T131742} :)
[17:17:25] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Move network::checks to netops::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/281465 
[17:17:27] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 
[17:17:29] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: netops: add IPv6 host checks [puppet] - 10https://gerrit.wikimedia.org/r/281473 
[17:17:31] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 
[17:18:01] <paravoid>	 apergos: snapshot1005 puppet failure
[17:18:12] <apergos>	 yep working on it
[17:18:14] <paravoid>	 Jeff_Green: alnilam has a disk alert
[17:18:26] <Jeff_Green>	 paravoid: looking
[17:18:34] <paravoid>	 Jeff_Green: silicon too
[17:18:46] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[17:18:56] <paravoid>	 Jeff_Green: and silicon also has a check_amq_store alert for a while now
[17:19:15] <paravoid>	 andrewbogott/chasemp: labvirt1002 disk space alert too
[17:19:31] <andrewbogott>	 paravoid: thanks, will look post-meeting
[17:19:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 (owner: 10Faidon Liambotis)
[17:20:14] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 
[17:21:10] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Move network::checks to netops::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/281465 (owner: 10Faidon Liambotis)
[17:21:58] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[17:23:50] <Jeff_Green>	 paravoid: ok, it's all the same thing ultimately--they're both basically early warnings re. activemq store to bloat
[17:24:07] <Jeff_Green>	 I'll adjust the thresholds and harangue fr-tech
[17:25:29] <paravoid>	 what's with stashbot?
[17:26:12] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] netops: abstract monitoring checks into a define [puppet] - 10https://gerrit.wikimedia.org/r/281466 (owner: 10Faidon Liambotis)
[17:26:33] <cmjohnson1>	 Will anyone have an issue if Phabricator is down for up to 10 mins?
[17:27:18] <mafk>	 fine for me cmjohnson1 
[17:29:04] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix up name of dumps repo for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281476 
[17:29:19] <greg-g>	 cmjohnson1: doit, not real "great" time today, so jfdi :)
[17:29:24] <greg-g>	 s/not/no/
[17:29:46] <cmjohnson1>	 k....i posted in -devtool channel 
[17:29:58] <cmjohnson1>	 will take it down in 10 mins
[17:29:59] <greg-g>	 thanks
[17:30:26] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix up name of dumps repo for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/281476 (owner: 10ArielGlenn)
[17:30:33] <paravoid>	 argh
[17:30:42] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: netops: add IPv6 host checks [puppet] - 10https://gerrit.wikimedia.org/r/281473 
[17:30:47] <greg-g>	 !log Phabricator going down in about 10 minutes to hopefully address the overheating issue: T131742
[17:30:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:30:52] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] netops: add IPv6 host checks [puppet] - 10https://gerrit.wikimedia.org/r/281473 (owner: 10Faidon Liambotis)
[17:31:52] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 
[17:32:00] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] netops: also monitor pfw-eqiad/pfw-codfw [puppet] - 10https://gerrit.wikimedia.org/r/281474 (owner: 10Faidon Liambotis)
[17:32:21] <greg-g>	 pre-emptive /topic change :)
[17:34:18] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) 
[17:37:46] <cmjohnson1>	 !log shutting down iridium to reapply thermal paste
[17:37:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:39:48] <icinga-wm>	 PROBLEM - Router interfaces on mr1-codfw.oob is CRITICAL: CRITICAL: No response from remote host 216.117.46.36 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[17:40:19] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad.oob is CRITICAL: CRITICAL: No response from remote host 198.32.107.153 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[17:40:48] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams.oob is CRITICAL: CRITICAL: No response from remote host 164.138.24.90 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[17:40:56] <paravoid>	 ignore those, that would be me
[17:41:29] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo.oob is CRITICAL: CRITICAL: No response from remote host 209.237.234.242 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[17:41:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[17:42:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[17:42:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!
[17:42:37] <_joe_>	 oh, heh
[17:42:42] <_joe_>	 that would be chris
[17:43:10] <greg-g>	 yep
[17:43:11] <Josve05a>	 ..."Phabricator will be down for about 10 minutes (T131742)" referencing a Phab-ticket....for when Phab is down...smart...
[17:43:12] <Josve05a>	 :p
[17:43:20] <greg-g>	 Josve05a: hey! what do you want?! :P
[17:43:32] <Josve05a>	 see the content of that ticket xD
[17:44:03] <greg-g>	 Josve05a: for when phab is back, see also: https://phabricator.wikimedia.org/T131775 (us asking for a backup web server for phab to make these things suck less)
[17:44:19] <Bsadowski1>	 pfft
[17:44:20] <andrewbogott>	 paravoid: I can't find the labvirt1002 disk space warning, did it clear on its own or was that a typo?
[17:44:33] <Josve05a>	 how much does such a thing cos (actual cost)
[17:44:34] <SPF|Cloud>	 It's just that iridium went down and someone is now (re-?)applying thermal paste on the CPU to prevent it from crashing again :)
[17:44:35] <Josve05a>	 cost*
[17:44:48] <paravoid>	 andrewbogott: I guess it cleared on its own
[17:44:51] <SPF|Cloud>	 it would be too long to put that in the topic
[17:45:40] <Amir1>	 I want to read the phab ticket but it is down
[17:45:43] <Amir1>	 :D
[17:46:12] <andrewbogott>	 paravoid: do you remember what the numbers were?  Was it a warning or a crit?
[17:46:26] <paravoid>	 no and it was a warning
[17:46:36] <andrewbogott>	 ok, good enough for me — thanks!
[17:46:37] <Amir1>	 you should send an announcement somewhere that people can read once the main thing is down, like wikitech-l
[17:47:38] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors
[17:48:58] <matt_flaschen>	 +1 Amir1 .  Link somewhere that isn't down (such as a wikitech-l archive post) to explain.
[17:49:30] <Amir1>	 yeah :)
[17:50:31] <_joe_>	 who broke icinga?
[17:50:41] <_joe_>	 the config, I mean
[17:51:07] <SPF|Cloud>	 I thought paravoid was making various config changes
[17:51:09] <volans>	 pfw-eqiad
[17:51:09] <cmjohnson1>	 iridium is back up
[17:51:17] <volans>	 Could not find any host matching 'pfw-eqiad' (config file '/etc/icinga/puppet_services.cfg', starting on line 322901)
[17:51:34] <paravoid>	 just puppet convergence errors I think
[17:51:37] <paravoid>	 I'm rerunning puppet
[17:51:48] <_joe_>	 paravoid: heh, that can happen, yes
[17:52:07] <cmjohnson1>	 phab is up
[17:52:18] <_joe_>	 ok, ttyl :)
[17:52:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy
[17:53:07] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2178470 (10Cmjohnson) a:5Cmjohnson>3None Clean off the old thermal paste and reapplied.  Let's monitor for the next few days. Lea...
[17:55:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[17:57:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[18:04:57] <cmjohnson1>	 !log db1052 swapping failed disk slot 8
[18:05:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:06:28] <icinga-wm>	 PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail
[18:09:08] <grrrit-wm>	 (03CR) 10Mobrovac: Scap3: chown the target root dir if owned by root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac)
[18:11:45] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: netops::check: make IPv4 mandatory again [puppet] - 10https://gerrit.wikimedia.org/r/281480 
[18:14:04] <wikibugs>	 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2178548 (10Volans) I've sync with @Cmjohnson and he swapped the disk, the RAID is now rebuilding:  ``` $ sudo megacli -PDRbld -ShowProg -PhysDrv [32:8] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 8 Completed...
[18:15:06] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] netops::check: make IPv4 mandatory again [puppet] - 10https://gerrit.wikimedia.org/r/281480 (owner: 10Faidon Liambotis)
[18:18:20] <cmjohnson1>	 !log stat1002 swapping failed disk slot 11
[18:18:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:20:59] <icinga-wm>	 RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:24:20] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0]
[18:25:01] <grrrit-wm>	 (03PS1) 10Rush: Revert "toollabs elastic don't use nginx light" [puppet] - 10https://gerrit.wikimedia.org/r/281482 
[18:27:03] <grrrit-wm>	 (03PS2) 10Rush: Revert "toollabs elastic don't use nginx light" [puppet] - 10https://gerrit.wikimedia.org/r/281482 
[18:30:40] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[18:33:15] <grrrit-wm>	 (03CR) 10Rush: [C: 032] Revert "toollabs elastic don't use nginx light" [puppet] - 10https://gerrit.wikimedia.org/r/281482 (owner: 10Rush)
[18:35:11] <icinga-wm>	 PROBLEM - Router interfaces on pfw-codfw is CRITICAL: CRITICAL: host 208.80.153.195, interfaces up: 90, down: 12, dormant: 0, excluded: 0, unused: 0BRge-11/0/14: down - BRge-11/0/15: down - BRge-2/0/8: down - BRreth0: down - BRge-9/0/3: down - BRge-11/0/6: down - BRge-0/0/3: down - BRge-2/0/9: down - BRge-2/0/14: down - BRge-11/0/7: down - BRge-0/0/2: down - BRge-9/0/2: down - BR
[18:35:30] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 108, down: 2, dormant: 0, excluded: 0, unused: 0BRvlan.1131: down - Subnet frack-external1-c-eqiadBRreth0: down - BR
[18:36:20] <tgr>	 !log ran MassMessages/sendMessages.php for T128056
[18:36:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:36:29] <grrrit-wm>	 (03CR) 10Bmansurov: "The new language overlay is in stable now: https://en.m.wikipedia.org/wiki/Book?mobileaction=stable#/languages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277837 (https://phabricator.wikimedia.org/T129274) (owner: 10Bmansurov)
[18:44:27] <wikibugs>	 6Operations, 10ops-eqiad, 10Phabricator, 6Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2176955 (10hashar) @cmjohnson do we have a system to monitor temperature? lm_sensors comes to mind, also found out Diamond has a coll...
[18:46:31] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] "+1 because this is a decent workaround for now." [puppet] - 10https://gerrit.wikimedia.org/r/281457 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema)
[18:47:38] <wikibugs>	 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178647 (10mmodell) >>! In T131775#2178325, @greg wrote: > See also: {T131742} :)  Plus, every deployment involves significant downtime because phabricator services must al...
[18:47:50] <grrrit-wm>	 (03PS1) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 
[18:49:21] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn)
[18:49:56] <grrrit-wm>	 (03PS2) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 
[18:50:01] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[18:50:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn)
[18:51:01] <grrrit-wm>	 (03PS3) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 
[18:52:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn)
[18:53:32] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: netops: disable SNMP checks for OOB interfaces [puppet] - 10https://gerrit.wikimedia.org/r/281490 
[18:54:04] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: netops: disable SNMP checks for OOB interfaces [puppet] - 10https://gerrit.wikimedia.org/r/281490 
[18:55:48] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] netops: disable SNMP checks for OOB interfaces [puppet] - 10https://gerrit.wikimedia.org/r/281490 (owner: 10Faidon Liambotis)
[18:57:34] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) 
[19:04:01] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0
[19:05:21] <grrrit-wm>	 (03PS4) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 
[19:06:38] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Add check_jnx_alarms to check Juniper chassis alarms [puppet] - 10https://gerrit.wikimedia.org/r/281467 (https://phabricator.wikimedia.org/T83992) (owner: 10Faidon Liambotis)
[19:10:03] <icinga-wm>	 PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: puppet fail
[19:11:53] <wikibugs>	 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2178741 (10ori) 5declined>3Open First of all, the request was not for global root. The task description makes it clear that the access request is for Swift machines. My follow-up comment was a prop...
[19:13:12] <grrrit-wm>	 (03PS5) 10ArielGlenn: add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 
[19:15:01] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add rest of new snapshots to scap3 target list [puppet] - 10https://gerrit.wikimedia.org/r/281486 (owner: 10ArielGlenn)
[19:16:15] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[19:18:16] <icinga-wm>	 RECOVERY - Router interfaces on pfw-codfw is OK: OK: host 208.80.153.195, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0
[19:19:10] <grrrit-wm>	 (03PS1) 10ArielGlenn: set up scap3 deploy for snapshot1006 snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/281494 
[19:19:26] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms
[19:19:28] <paravoid>	 chasemp: can I get the config lock on cr1-eqiad?
[19:20:02] <chasemp>	 yep go for it I'm puzzling over a failure to reorder here no worries paravoid
[19:20:30] <paravoid>	 chasemp: can you rollback; quit?
[19:21:34] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] set up scap3 deploy for snapshot1006 snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/281494 (owner: 10ArielGlenn)
[19:22:17] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms Faidon Liambotis Still under provisioning
[19:22:23] <chasemp>	 paravoid: gtg I think?
[19:22:31] <paravoid>	 andrewbogott: labvirt1002 again -- DISK WARNING - free space: /var/lib/nova/instances 158834 MB (6% inode=99%):
[19:22:42] <andrewbogott>	 paravoid: ok, thanks
[19:25:53] <wikibugs>	 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2150515 (10BBlack) >>! In T130910#2178741, @ori wrote: > First of all, the request was not for global root. The task description makes it clear that the access request is for Swift machines. My follow-...
[19:26:46] <grrrit-wm>	 (03CR) 10Thcipriani: Scap3: chown the target root dir if owned by root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac)
[19:28:17] <wikibugs>	 6Operations, 10Monitoring, 10netops, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178759 (10faidon)
[19:28:36] <icinga-wm>	 RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:31:36] <wikibugs>	 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2178776 (10chasemp) Afaik the 'talk to phabricator' portion here is relevant for git-ssh...
[19:38:04] <akosiaris>	 !log depool maps-test2004 
[19:38:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:38:46] <Luke081515>	 greg-g: I guess you forgot to update the topic?^^
[19:40:01] <apergos>	 !log disabled puppet on snapshot1001,2,4  while new hosts come on line, til probably Apr 5-6
[19:40:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:40:15] <greg-g>	 Luke081515: yeah, oops
[19:40:24] <paravoid>	 apergos: ugh why
[19:40:39] <greg-g>	 (and I went away to lunch ;) )
[19:41:04] <grrrit-wm>	 (03PS2) 10Krinkle: webperf: Rename navtiming 'loading' and 'sending' to standard equivalent [puppet] - 10https://gerrit.wikimedia.org/r/281082 
[19:41:06] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: netops: monitor all asw/msw/psw as well [puppet] - 10https://gerrit.wikimedia.org/r/281495 (https://phabricator.wikimedia.org/T83992) 
[19:41:19] <apergos>	 paravoid: trying to refactor without worrying about the old hosts, which will be removed from puppet right after that new ones are running
[19:41:27] <apergos>	 (decommissioned)
[19:41:37] <apergos>	 it's only til tomorrow
[19:41:37] <paravoid>	 well then remove the includes from puppet or something
[19:42:54] <apergos>	 I don't actually want it (yet) to remove anything just in case I get stuck; this leaves me room to back out (and yes, sometimes commenting out or removing a stanza means things get removed, annoyingly enough)
[19:42:55] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] netops: monitor all asw/msw/psw as well [puppet] - 10https://gerrit.wikimedia.org/r/281495 (https://phabricator.wikimedia.org/T83992) (owner: 10Faidon Liambotis)
[19:43:25] <paravoid>	 then leave it as-is and don't merge your refactors
[19:43:35] <paravoid>	 this is really not the right way to refactor
[19:44:54] <apergos>	 again, it's only for a day. not even that 
[19:45:40] <apergos>	 I just know I cannot get it done tonight, without it taking a lot longer than if I wait til tomorrow am
[19:50:32] <grrrit-wm>	 (03PS3) 10Krinkle: webperf: Rename navtiming 'loading' and 'sending' to standard equivalent [puppet] - 10https://gerrit.wikimedia.org/r/281082 
[19:50:34] <grrrit-wm>	 (03PS1) 10Krinkle: webperf: Convert navtiming metric mapping into list [puppet] - 10https://gerrit.wikimedia.org/r/281497 
[19:50:36] <grrrit-wm>	 (03PS1) 10Krinkle: webperf: Collect metrics for 'domInteractive' and 'domComplete' [puppet] - 10https://gerrit.wikimedia.org/r/281498 
[19:50:39] <grrrit-wm>	 (03PS1) 10BBlack: VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) 
[19:51:49] <grrrit-wm>	 (03PS2) 10BBlack: VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) 
[19:55:00] <grrrit-wm>	 (03PS1) 10Krinkle: coal-web: Show domInteractive instead of domComplete [puppet] - 10https://gerrit.wikimedia.org/r/281501 
[20:00:05] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2000).
[20:00:24] <bearND>	 no mobileapps deployment today
[20:00:26] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors
[20:04:25] <icinga-wm>	 PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 3 failures
[20:04:49] <subbu>	 !log starting parsoid deploy
[20:04:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:07:31] <subbu>	 !log synced code; restarted parsoid on wtp1002 as a canary
[20:07:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:07:57] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 3 failures alexandros kosiaris masked nodejs services
[20:12:23] <subbu>	 !log finished deploying parsoid sha 579ec3e6
[20:12:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:12:47] <andrewbogott>	 gwicke, subbu, is labs instance 'appservice' still in use?  And, if so, can it survive a few minutes of downtime?
[20:13:14] * subbu doesn't know what it is used for
[20:13:35] <gwicke>	 andrewbogott: it is in use, yes
[20:13:55] <andrewbogott>	 gwicke: and, downtime ok?
[20:13:58] <gwicke>	 it is exercised as part of RB integration tests on each commit
[20:14:22] <gwicke>	 a few minutes of downtime should be okay
[20:14:32] <gwicke>	 we can always re-run failed tests
[20:15:44] <andrewbogott>	 gwicke: ok, thank you, I'm going to migrate it elsewhere
[20:16:05] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: monitoring: add "switches" hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/281529 
[20:16:11] <gwicke>	 andrewbogott: thank you!
[20:16:34] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] monitoring: add "switches" hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/281529 (owner: 10Faidon Liambotis)
[20:18:36] <icinga-wm>	 PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail
[20:21:21] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[20:21:43] <icinga-wm>	 RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[20:24:52] <icinga-wm>	 PROBLEM - Juniper alarms on asw-c-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms
[20:25:06] <wikibugs>	 6Operations, 10Monitoring, 10netops, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#2178874 (10faidon)
[20:38:08] <grrrit-wm>	 (03PS1) 10ArielGlenn: [WIP] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 
[20:38:38] <grrrit-wm>	 (03PS6) 10Gehel: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson)
[20:39:14] <grrrit-wm>	 (03CR) 10Gehel: "Added "20" to the commit message as it was also added as a potential master." [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson)
[20:39:30] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 (owner: 10ArielGlenn)
[20:44:31] <grrrit-wm>	 (03PS2) 10ArielGlenn: [WIP] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 
[20:52:30] <wikibugs>	 6Operations, 10Traffic, 10Wikimedia-Shop, 7HTTPS: shop switches HTTPS -> HTTP when showing login prompt (on clicking checkout) - https://phabricator.wikimedia.org/T63528#2179008 (10GHoltman) 5Open>3Resolved a:3GHoltman  Resolved per HuiZSF
[20:59:45] <gehel>	 do we have canary servers for mediawiki ? I know about the test servers (mw1017, etc), but do we have canaries with real user traffic ?
[21:00:30] <grrrit-wm>	 (03PS3) 10ArielGlenn: basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 
[21:01:06] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) 
[21:01:08] <volans>	 gehel: I can see some mw with value canary and single_canary if you grep on puppet
[21:01:49] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) 
[21:02:01] <volans>	 but I don't know how they are treated
[21:02:12] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) 
[21:02:19] <grrrit-wm>	 (03PS4) 10ArielGlenn: basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 
[21:03:56] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Increase Horizon timeout to 24 hours. [puppet] - 10https://gerrit.wikimedia.org/r/281531 (https://phabricator.wikimedia.org/T130621) (owner: 10Andrew Bogott)
[21:08:54] <gehel>	 volans: If I read this correctly, the "canary" mw servers in puppet code is just a way to mark them for salt. Which probably indicates that I could use them as early deploys for CirrusSearch over HTTPS...
[21:10:06] <volans>	 looks like this, but I didn't check if then that tag is used somewhere else
[21:10:23] <grrrit-wm>	 (03PS5) 10ArielGlenn: basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 
[21:10:50] <gehel>	 time to go to sleep. I'll check more in depth tomorrow...
[21:11:35] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] basic snapshot role for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281530 (owner: 10ArielGlenn)
[21:16:42] <icinga-wm>	 PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:28:08] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: include the decl of repodir in the class using it [puppet] - 10https://gerrit.wikimedia.org/r/281533 
[21:29:49] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: include the decl of repodir in the class using it [puppet] - 10https://gerrit.wikimedia.org/r/281533 (owner: 10ArielGlenn)
[21:31:42] <icinga-wm>	 PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:35:03] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:42:32] <grrrit-wm>	 (03PS1) 10ArielGlenn: explicitly define repodir on snpashots for cron run [puppet] - 10https://gerrit.wikimedia.org/r/281536 
[21:44:11] <icinga-wm>	 RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:45:48] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] explicitly define repodir on snpashots for cron run [puppet] - 10https://gerrit.wikimedia.org/r/281536 (owner: 10ArielGlenn)
[21:47:51] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[21:50:52] <icinga-wm>	 PROBLEM - puppet last run on mw2049 is CRITICAL: CRITICAL: puppet fail
[21:59:01] <icinga-wm>	 RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:04:23] <wikibugs>	 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2179269 (10Volans) ``` $ sudo megacli -PDRbld -ShowProg -PhysDrv [32:8] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 8 Completed 47% in 237 Minutes. ``` All looks good on out monitoring metrics and on the host.
[22:04:53] <grrrit-wm>	 (03PS1) 10ArielGlenn: commit the change of the var lookup for the repodir [puppet] - 10https://gerrit.wikimedia.org/r/281541 
[22:06:28] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] commit the change of the var lookup for the repodir [puppet] - 10https://gerrit.wikimedia.org/r/281541 (owner: 10ArielGlenn)
[22:18:22] <icinga-wm>	 RECOVERY - puppet last run on mw2049 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[22:23:17] <grrrit-wm>	 (03PS1) 10ArielGlenn: minor restructure of hiera data for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281544 
[22:24:39] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] minor restructure of hiera data for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281544 (owner: 10ArielGlenn)
[22:36:52] <grrrit-wm>	 (03PS1) 10ArielGlenn: enabling hhvm for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281547 (https://phabricator.wikimedia.org/T94277) 
[22:38:08] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] enabling hhvm for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281547 (https://phabricator.wikimedia.org/T94277) (owner: 10ArielGlenn)
[22:41:56] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Enable Translate extension on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281403 (https://phabricator.wikimedia.org/T131731) (owner: 10Urbanecm)
[22:45:39] <Luke081515>	 jouncebot next
[22:45:39] <jouncebot>	 In 0 hour(s) and 14 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2300)
[22:47:23] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on asw-c-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Faidon Liambotis 2½ year old alarm. Will be fixed on the next JunOS upgrade.
[22:48:17] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: hhvm: use random tmpdir in hhvm-collect-heaps [puppet] - 10https://gerrit.wikimedia.org/r/263829 
[22:48:41] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] hhvm: use random tmpdir in hhvm-collect-heaps [puppet] - 10https://gerrit.wikimedia.org/r/263829 (owner: 10Faidon Liambotis)
[22:48:56] <paravoid>	 apergos: you have unmerged changes
[22:49:00] <paravoid>	 ok to merge?
[22:49:06] <apergos>	 crap
[22:49:10] <apergos>	 yes please
[22:49:18] <apergos>	 thanks
[22:49:20] <apergos>	 paravoid: 
[22:49:45] <paravoid>	 (done)
[22:50:46] <grrrit-wm>	 (03PS1) 10ArielGlenn: add mediawiki packages and other dependencies for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281552 
[22:50:57] <apergos>	 thanks!
[22:52:06] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Remove scs-oe11-esams DNS [dns] - 10https://gerrit.wikimedia.org/r/281116 (owner: 10Faidon Liambotis)
[22:52:58] <Luke081515>	 paravoid: Is stashbot still banned?
[22:53:02] <icinga-wm>	 PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures
[22:54:51] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[22:55:04] <Dereckson>	 Hello.
[22:55:23] <greg-g>	 Luke081515: I don't think it is banned, it just isn't working
[22:55:31] <paravoid>	 it was banned by me
[22:55:39] <greg-g>	 I turned it off
[22:56:18] <greg-g>	 in case you're curious: https://github.com/bd808/tools-stashbot/issues/9
[22:57:50] <Dereckson>	 Krenair or MaxSem > can we work together on the SWAT?
[22:57:53] <grrrit-wm>	 (03PS2) 10ArielGlenn: add mediawiki packages and other dependencies for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281552 
[22:58:07] <MaxSem>	 Dereckson, sure
[22:58:32] <MaxSem>	 how do you want to communicate?
[22:58:52] <Krenair>	 hi
[22:58:57] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add mediawiki packages and other dependencies for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281552 (owner: 10ArielGlenn)
[22:59:52] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn)
[23:00:04] <Luke081515>	 Dereckson wants to deploy the patch for his homewiki? ;)
[23:00:04] <jouncebot>	 RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160404T2300). Please do the needful.
[23:00:04] <jouncebot>	 Luke081515: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:09] <Dereckson>	 How can we do that? Do you know a quick software to share a screen? Or I can also create a shared@ account on my server, from there we can share a tmux, and I su under my account.
[23:00:13] * Luke081515 is already here
[23:00:17] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn)
[23:00:21] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[23:00:41] <Krenair>	 Dereckson, you haven't deployed by yourself yet?
[23:00:59] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn)
[23:01:05] <Dereckson>	 Er yes, one config patch Saturday morning.
[23:01:15] <Dereckson>	 and the logo for cs.
[23:01:30] <Krenair>	 what do you mean saturday morning?
[23:01:49] <Krenair>	 stuff happened on saturday? huh, ok
[23:01:56] <Dereckson>	 08:29 logmsgbot: dereckson@tin Synchronized wmf-config/throttle.php: Fix throttle rules (Gerrit change 280819). (duration: 00m 29s)
[23:02:16] <Dereckson>	 the usual ip / IP typo in throttle.php to fix.
[23:02:34] <MaxSem>	 srsly, dude
[23:02:38] <Luke081515>	 In theory I got three changes, so everyone can deploy :D
[23:02:46] <MaxSem>	 on Saturday, your first deploy
[23:03:33] <Krenair>	 Dereckson, anyway, do you have any questions about deploying?
[23:03:45] <greg-g>	 wait, Dereckson why did you deploy on Saturday?
[23:04:13] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail
[23:04:33] <Dereckson>	 greg-g: reedy added to the hackathon a throttle rule
[23:04:55] <Dereckson>	 he copied/pasted a block I wrote for another rule
[23:05:06] <Dereckson>	 both blocked used "ip" in lowercase instead of "IP" in uppercase
[23:05:07] <MaxSem>	 do you even have access to the contact list in case you broke prod?
[23:05:10] <Krenair>	 which turns out to have been broken
[23:05:19] <Krenair>	 MaxSem, you mean the staff contact list?
[23:05:26] <MaxSem>	 yep
[23:05:37] <Krenair>	 Dereckson doesn't have an account on that wiki if that's what you're asking
[23:05:47] <MaxSem>	 like, people who can fix cluster if it goes boom
[23:06:04] <Dereckson>	 Reedy were on the channel at this moment with me.
[23:06:54] <Luke081515>	 maybe you can discuss that after the SWAT? The swat actually blocks my work on my bot
[23:07:05] <greg-g>	 I'm a little sad that your first deploy happened on a saturday, even with Reedy's prodding
[23:08:22] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) 
[23:08:30] <greg-g>	 ok, so, I'll talk with Dereckson and Reedy tomorrow or so about this, for now just get the swat done
[23:08:39] <Luke081515>	 ok
[23:08:41] <Krenair>	 Dereckson is there a problem with this swat?
[23:08:49] <MaxSem>	 I'll do it meanwhile
[23:08:57] <greg-g>	 Dereckson: for the record, please only deploy during swat windows
[23:09:01] <Dereckson>	 No, it seems three changes to merge with only sync-file.
[23:09:04] <Dereckson>	 greg-g: okay
[23:09:42] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0]
[23:10:57] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515)
[23:11:12] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0]
[23:11:33] <grrrit-wm>	 (03Merged) 10jenkins-bot: Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515)
[23:13:45] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/281314/ (duration: 00m 28s)
[23:13:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:13:51] <MaxSem>	 Luke081515, ^
[23:13:55] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix one more stray directory reference in snapshot misc cron job [puppet] - 10https://gerrit.wikimedia.org/r/281560 
[23:14:01] <icinga-wm>	 PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:14:28] <Luke081515>	 change broke nothing, but one problem with this change: https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:Seznam_u%C5%BEivatelsk%C3%BDch_pr%C3%A1v
[23:14:39] <Luke081515>	 created one additional group instead of adding one permission
[23:14:41] <Dereckson>	 greg-g: MaxSem: by the way, erratum, it were Friday morning, the first day of the hackathon
[23:14:44] <Luke081515>	 I will write a fix
[23:15:31] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Update project logo for ast.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio)
[23:16:00] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 
[23:16:19] <grrrit-wm>	 (03Merged) 10jenkins-bot: Update project logo for ast.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio)
[23:16:29] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 (owner: 10Faidon Liambotis)
[23:16:31] <icinga-wm>	 RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[23:16:32] <Luke081515>	 Dereckson: Do got the link where the name of the course coordinator is listed again? I need to fix the name
[23:17:29] <logmsgbot>	 !log maxsem@tin Synchronized static/images/project-logos/astwiki.png: https://gerrit.wikimedia.org/r/#/c/280445/ (duration: 00m 27s)
[23:17:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:17:52] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 
[23:17:52] <Luke081515>	 ok, this patch works
[23:18:18] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/280445/ (duration: 00m 27s)
[23:18:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:18:36] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [V: 032] otrs: remove monitoring::service['https'] [puppet] - 10https://gerrit.wikimedia.org/r/281561 (owner: 10Faidon Liambotis)
[23:19:10] <MaxSem>	 now about the frwiki patch... do we know how to test it?
[23:19:23] <Dereckson>	 Luke081515: https://www.mediawiki.org/wiki/Extension:Education_Program/Preferences#Course_coordinators
[23:19:26] <Luke081515>	 we can take a look if the protection level is there
[23:19:36] <Luke081515>	 and I'm in #wikipedia-fr, can ask an sysop there
[23:19:40] <Luke081515>	 thx, dereckson
[23:20:44] <Luke081515>	 Dereckson: There is something wrong at that mw.org page: My patch has the exact user group name as there
[23:20:57] <wikibugs>	 6Operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#1840320 (10faidon) Ping? If this can't be fixed anytime soon, can we remove the check from the servers on puppet at least? (I've been auditing acknowledged-but-forgotte...
[23:20:58] <Luke081515>	 but the patch created an additional group
[23:21:52] <Luke081515>	 found the right name at the source code
[23:22:11] <MaxSem>	 ok...
[23:22:38] <Luke081515>	 mw.org has ep.coordiantor, while the program uses epcoordinator
[23:22:44] <Luke081515>	 *ep-coordiantor
[23:22:54] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Add 'editextendedsemiprotected' protection level on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281070 (https://phabricator.wikimedia.org/T131109) (owner: 10Luke081515)
[23:24:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add 'editextendedsemiprotected' protection level on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281070 (https://phabricator.wikimedia.org/T131109) (owner: 10Luke081515)
[23:24:22] <icinga-wm>	 PROBLEM - HTTPS on mendelevium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: SSL connect attempt failed with unknown error error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
[23:25:12] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/281070/ (duration: 00m 32s)
[23:25:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:25:56] <paravoid>	 (ignore the mendelevium alert, see my patchset above)
[23:26:00] <paravoid>	 (not that anyone is looking :P)
[23:26:26] <Luke081515>	 ok, all looks like expected, I ask at the frwp channel to try it out
[23:26:59] <grrrit-wm>	 (03PS1) 10Luke081515: fix epcoordinator name at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281565 (https://phabricator.wikimedia.org/T131684) 
[23:27:14] <Luke081515>	 maxsem: ^
[23:28:26] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] fix epcoordinator name at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281565 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515)
[23:28:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: fix epcoordinator name at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281565 (https://phabricator.wikimedia.org/T131684) (owner: 10Luke081515)
[23:29:45] <Luke081515>	 maxsem, dereckson: Ok, patch for frwp works: https://fr.wikipedia.org/w/index.php?title=Utilisateur%3AAsh_Crow%2FBrouillon%2FFirefly&type=revision&diff=125012838&oldid=105231109
[23:30:06] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/281565/ (duration: 00m 29s)
[23:30:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:30:13] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: puppet fail
[23:30:38] <Luke081515>	 maxsem: Got it, works now as expected
[23:30:41] <Luke081515>	 thanks for SWAT
[23:30:44] <MaxSem>	 awsum
[23:31:26] <paravoid>	 6 out of 19 alerts are yours now :P
[23:33:08] <Dereckson>	 MaxSem: so where can I got a copy of this contact list?
[23:34:03] <grrrit-wm>	 (03PS1) 10ArielGlenn: remove nutcracker for now, broken manifest on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281569 
[23:34:18] <Krenair>	 Dereckson, you can't without diving into officewiki from the server side
[23:34:30] <icinga-wm>	 PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: puppet fail
[23:35:14] <grrrit-wm>	 (03PS2) 10ArielGlenn: remove nutcracker for now, broken manifest on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281569 
[23:37:05] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] remove nutcracker for now, broken manifest on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281569 (owner: 10ArielGlenn)
[23:40:28] <Luke081515>	 Dereckson: I fixed the wrong information at mw.org about the education programm group names
[23:40:31] <icinga-wm>	 RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:44:02] <Dereckson>	 Luke081515: perfect
[23:54:55] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1005 is CRITICAL: CRITICAL: puppet fail arielglenn packages missing on jessie (of course).
[23:54:55] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1006 is CRITICAL: CRITICAL: puppet fail arielglenn packages missing on jessie (of course).
[23:54:55] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on snapshot1007 is CRITICAL: CRITICAL: puppet fail arielglenn packages missing on jessie (of course).
[23:56:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time
[23:57:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.051 second response time
[23:58:31] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]