[00:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161111T0000). Please do the needful.
[00:00:05] <jouncebot>	 RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:44] <grrrit-wm>	 (03CR) 10Reedy: "Pretty much. Want to test the script manually on beta first to check the internals still work... Not sure they've been run for a while!" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy)
[00:01:29] <RoanKattouw>	 My patch is the only one, so I'll do the SWAT
[00:01:45] <grrrit-wm>	 (03PS2) 10Catrope: Enable {{NOINDEX}} as a noindex template on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) 
[00:02:13] <grrrit-wm>	 (03CR) 10Catrope: [C: 032] Enable {{NOINDEX}} as a noindex template on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope)
[00:02:45] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable {{NOINDEX}} as a noindex template on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope)
[00:03:27] <grrrit-wm>	 (03CR) 10Legoktm: "PageTriage has switched to extension.json, so there's no need for $wg = $wmg anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope)
[00:05:14] <icinga-wm>	 RECOVERY - Disk space on elastic1024 is OK: DISK OK
[00:07:46] <mutante>	 ^ checked with discovery, that was a reindex, it needs a lot more disk but only temp
[00:09:04] <kaldari>	 RoanKattouw: You should be able to test it with the article https://en.wikipedia.org/wiki/Youssif_Isa
[00:10:03] <RoanKattouw>	 Thanks man
[00:10:09] <kaldari>	 RoanKattouw: It currently doesn't have a noindex tag, but it should after the change.
[00:10:54] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[00:10:58] <grrrit-wm>	 (03CR) 10Catrope: "Good point. There were already two wmg's there, so I assumed I needed one too. I'll clean them all up in one go afterwards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319348 (https://phabricator.wikimedia.org/T149538) (owner: 10Catrope)
[00:12:30] <grrrit-wm>	 (03PS2) 10Kaldari: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 
[00:12:36] <RoanKattouw>	 yay, it works on mw1099
[00:13:24] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:13:47] <RoanKattouw>	 kaldari: You want that trademark one to ride along too?
[00:13:50] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/CommonSettings.php: Enable {{NOINDEX}} as a noindex template on enwiki (1/2) (T149538) (duration: 00m 49s)
[00:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:57] <stashbot>	 T149538: Noindex template feature should be restricted to new articles - https://phabricator.wikimedia.org/T149538
[00:14:04] <kaldari>	 RoanKattouw: Oh, sure
[00:14:36] <grrrit-wm>	 (03PS3) 10Kaldari: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 (https://phabricator.wikimedia.org/T95007) 
[00:15:04] <grrrit-wm>	 (03CR) 10Catrope: [C: 032] Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari)
[00:15:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Removing registered trademark symbol from footer of Office Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320865 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari)
[00:15:39] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable {{NOINDEX}} as a noindex template on enwiki (2/2) (T149538) (duration: 00m 47s)
[00:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:22] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Remove registered trademark symbol from officewiki footer (T95007) (duration: 00m 48s)
[00:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:28] <stashbot>	 T95007: Improve trademark code in MobileFrontend - https://phabricator.wikimedia.org/T95007
[00:23:07] <godog>	 !log swift eqiad-prod: ms-be1027 to weight 1000 - T136631
[00:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:14] <stashbot>	 T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631
[00:24:23] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10fgiunchedi)
[00:24:25] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2787963 (10fgiunchedi) 05Open>03Resolved thanks @Cmjohnson for taking care of this! LGTM now, will progressively put the machine in service in {T136631}
[00:24:43] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10fgiunchedi) a:05Cmjohnson>03fgiunchedi
[00:24:54] <icinga-wm>	 PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:32:25] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1027 is CRITICAL: CRITICAL: Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T150498
[00:32:28] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1027 - https://phabricator.wikimedia.org/T150498#2787976 (10ops-monitoring-bot)
[00:36:39] <godog>	 wah wah waaaaahhh
[00:37:03] <mutante>	 :)
[00:37:16] <mutante>	 you mean the auto-ack, right
[00:39:41] <godog>	 no the fact that the host failed _again_
[00:42:24] <icinga-wm>	 RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[00:45:44] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2788018 (10fgiunchedi)
[00:45:47] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2788015 (10fgiunchedi) 05Resolved>03Open I spoke way too soon, machine still reports failures on SSDs as in P4409 :(  Looks like to me it might be just DOA?
[00:48:25] <mutante>	 godog: owww...ok
[00:49:35] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1027 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:4:1, 2I:4:2 - OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T150500
[00:49:38] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1027 - https://phabricator.wikimedia.org/T150500#2788019 (10ops-monitoring-bot)
[00:53:22] <wikibugs>	 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2738609 (10greg) Obligatory UBN! priority check-in after 2.5 weeks. Is that prio still valid? Should this be prioritized within some team more highly? There's a relate...
[00:53:36] <cmjohnson>	 godog: That is awesome news!!! Happy that is over
[00:53:54] <icinga-wm>	 RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[00:54:28] <mutante>	 so, its is it over or is it a second fail?
[00:55:03] <godog>	 the latter
[00:55:09] <godog>	 cmjohnson: not over :(((((
[00:55:30] <cmjohnson>	  ....
[00:56:16] <cmjohnson>	 that server is going to be the death of me
[00:58:57] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2788085 (10Liuxinyu970226)
[00:59:42] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2788089 (10Dzahn) 05Resolved>03Open < ebernhardson> mutante: thanks for the ping, but in general you don't have to worry about...
[01:01:05] <godog>	 seriously
[01:01:54] <mutante>	 maybe not worth it.. hardware donation to other non-profit ?
[01:03:37] <godog>	 nah it is under warranty heh
[01:04:02] <cmjohnson>	 i've replaced just about everything...guess now I need another disk
[01:04:30] <godog>	 !log revert swift ring change for ms-be1027
[01:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:11] <grrrit-wm>	 (03PS1) 10BBlack: Test write buffer size theory for extra RTT [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320939 
[01:05:11] <godog>	 cmjohnson: sigh, including replacgin the controller?
[01:05:13] <grrrit-wm>	 (03PS1) 10BBlack: nginx (1.11.4-1+wmf15) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320940 
[01:05:27] <cmjohnson>	 I hate HP
[01:05:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[01:06:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[01:09:10] <godog>	 cmjohnson: I have to run, the machine is now in icinga though so it'll alarm if you take it down, it is otherwise in your hands
[01:09:42] <cmjohnson>	 okay...thx for letting me know. I will take a hammer to it in the morning! ;-)
[01:11:00] <grrrit-wm>	 (03PS1) 10Dduvall: [WIP] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) 
[01:13:51] <grrrit-wm>	 (03PS1) 10Dzahn: mgmt: fix typos in getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/320943 
[01:16:13] <grrrit-wm>	 (03PS2) 10Dzahn: mgmt: fix typos in getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/320943 
[01:16:34] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] mgmt: fix typos in getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/320943 (owner: 10Dzahn)
[01:17:04] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:25:06] <icinga-wm>	 PROBLEM - Disk space on ms-be1027 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error
[01:25:06] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[01:25:07] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[01:25:07] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[01:25:14] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[01:25:24] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[01:25:24] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[01:25:24] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[01:25:24] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[01:25:34] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[01:25:44] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[01:25:44] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[01:25:54] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[01:29:20] <madhuvishy>	 ummm godog are you working on this ^
[01:30:48] <grrrit-wm>	 (03PS2) 10BBlack: Test another write buffer size theory for extra RTT [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320939 
[01:30:50] <grrrit-wm>	 (03PS2) 10BBlack: nginx (1.11.4-1+wmf15) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320940 
[01:32:37] <grrrit-wm>	 (03PS1) 10Dzahn: mgmt: add success/fail logs to changepw [puppet] - 10https://gerrit.wikimedia.org/r/320945 
[01:37:05] <mutante>	 incoming ... really quick gerrit restart for config change 
[01:37:16] <grrrit-wm>	 (03PS6) 10Dzahn: Gerrit: Up the size for packedGitLimit to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox)
[01:38:51] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Gerrit: Up the size for packedGitLimit to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox)
[01:39:51] <mutante>	 !log gerrit restarting for config change 317322 (T148478)
[01:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:39:59] <stashbot>	 T148478: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478
[01:40:38] <mutante>	 grrrit-wm: restart
[01:40:40] <grrrit-wm>	 re-connecting to gerrit
[01:40:41] <grrrit-wm>	 reconnected to gerrit
[01:40:44] <mutante>	 sweet
[01:40:47] <mutante>	 and done
[01:41:15] <mutante>	 hopefully that will help with performance of gerrit now
[01:43:04] <grrrit-wm>	 (03PS1) 10Madhuvishy: labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) 
[01:45:04] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[01:45:06] <mutante>	 !log gerrit now has higher "packedGitLimit" of 2g, goal is to reduce Gerrit slowdowns 
[01:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:52] <grrrit-wm>	 (03CR) 10Dzahn: "done. gerrit restarted." [puppet] - 10https://gerrit.wikimedia.org/r/317322 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox)
[01:49:35] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032] labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy)
[01:49:53] <grrrit-wm>	 (03CR) 10Madhuvishy: labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy)
[01:50:33] <grrrit-wm>	 (03PS2) 10Madhuvishy: labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) 
[01:51:06] <godog>	 madhuvishy: yeah that was me, renewed the downtime, thanks !
[01:51:19] <madhuvishy>	 godog: okay cool :)
[01:52:03] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788190 (10Dzahn) We have now increased the packedGitLimit setting to 2g.  Like @20after4 originally said on [1]  "2...
[01:52:12] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032] labstore: Check that NFS is being served over Cluster IP for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/320946 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy)
[01:57:24] <grrrit-wm>	 (03PS1) 10Madhuvishy: labstore: Rename secondary cluster monitoring descriptions [puppet] - 10https://gerrit.wikimedia.org/r/320949 
[01:58:50] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1027 - https://phabricator.wikimedia.org/T150500#2788197 (10fgiunchedi) 05Open>03Invalid See also T140374
[01:58:51] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032] labstore: Rename secondary cluster monitoring descriptions [puppet] - 10https://gerrit.wikimedia.org/r/320949 (owner: 10Madhuvishy)
[01:59:28] <grrrit-wm>	 (03PS1) 10BBlack: test commit, 8k default buffer [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320950 
[01:59:31] <grrrit-wm>	 (03PS1) 10BBlack: openssl (1.1.0c-1+wmf2) jessie-wikimedia; urgency=medium [debs/openssl11] - 10https://gerrit.wikimedia.org/r/320951 
[02:00:38] <wikibugs>	 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2788204 (10GWicke) The main benefit of encoding the original dimensions in the URL would be consistency across formats, and some amount of ease of use....
[02:03:40] <grrrit-wm>	 (03CR) 10Dzahn: "hmm .. http://puppet-compiler.wmflabs.org/4584/" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn)
[02:08:16] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "I moved it to the ipmi module, but this doesn't install it globally as intended, this just installs it on puppetmaster, bast4001 and saltm" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn)
[02:18:08] <grrrit-wm>	 (03PS5) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) 
[02:20:32] <grrrit-wm>	 (03CR) 10Dzahn: "So it would have to be like PS5 then to work. Add a second class in mdoule ipmi that just installs the packages and include that in base. " [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn)
[02:23:11] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 04m 56s)
[02:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:58] <grrrit-wm>	 (03CR) 10Dzahn: "the compiler says there would be no change but that's not true, bug T149432. if you look at the actual catalog the freeipmi packages get i" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn)
[02:28:24] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Nov 11 02:28:24 UTC 2016 (duration 5m 14s)
[02:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:46:20] <grrrit-wm>	 (03CR) 10Papaul: [C: 032] mgmt: add success/fail logs to changepw [puppet] - 10https://gerrit.wikimedia.org/r/320945 (owner: 10Dzahn)
[02:46:52] <grrrit-wm>	 (03CR) 10Papaul: "Tested and works" [puppet] - 10https://gerrit.wikimedia.org/r/320945 (owner: 10Dzahn)
[02:50:59] <grrrit-wm>	 (03PS2) 10Papaul: mgmt: add success/fail logs to changepw [puppet] - 10https://gerrit.wikimedia.org/r/320945 (owner: 10Dzahn)
[03:05:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1811.361844 Seconds
[03:06:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds
[03:06:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[03:07:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[03:18:40] <icinga-wm>	 PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:20:00] <icinga-wm>	 PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:25:23] <grrrit-wm>	 (03PS1) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 
[03:29:41] <grrrit-wm>	 (03PS2) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 
[03:35:35] <grrrit-wm>	 (03PS3) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 
[03:39:21] <grrrit-wm>	 (03PS1) 10Dzahn: consistent capitalization of mgmt asset tag names [dns] - 10https://gerrit.wikimedia.org/r/320959 
[03:47:17] <grrrit-wm>	 (03PS2) 10Dzahn: consistent capitalization of mgmt asset tag names [dns] - 10https://gerrit.wikimedia.org/r/320959 
[03:47:41] <icinga-wm>	 RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[03:48:00] <icinga-wm>	 RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[03:53:25] <grrrit-wm>	 (03PS4) 10Dzahn: fix mgmt names in wrong data center [dns] - 10https://gerrit.wikimedia.org/r/320954 (https://phabricator.wikimedia.org/T149875) 
[04:00:53] <grrrit-wm>	 (03CR) 10Dzahn: "Host wmf3138.mgmt.eqiad.wmnet. not found: 3(NXDOMAIN)" [dns] - 10https://gerrit.wikimedia.org/r/320954 (https://phabricator.wikimedia.org/T149875) (owner: 10Dzahn)
[04:06:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[04:09:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[05:25:08] <wikibugs>	 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2788286 (10Naveenpf) @CRoslof This is an enhancement request. If someone take wikipedia.in now it is redirecting to new URL. There is no point in...
[05:42:40] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:05:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[06:07:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[06:10:40] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[06:27:30] <icinga-wm>	 PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:44:30] <icinga-wm>	 PROBLEM - MD RAID on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:47:30] <icinga-wm>	 RECOVERY - MD RAID on thumbor1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[06:54:28] <grrrit-wm>	 (03PS1) 10Madhuvishy: labstore: Set mailto address for secondary backups cron [puppet] - 10https://gerrit.wikimedia.org/r/320962 (https://phabricator.wikimedia.org/T144633) 
[06:55:50] <icinga-wm>	 PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:56:30] <icinga-wm>	 PROBLEM - MD RAID on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:56:30] <icinga-wm>	 RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[06:57:20] <icinga-wm>	 RECOVERY - MD RAID on thumbor1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[07:17:06] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2788294 (10Marostegui)
[07:24:50] <icinga-wm>	 RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[07:26:08] <grrrit-wm>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 
[07:27:27] <grrrit-wm>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 
[07:30:22] <grrrit-wm>	 (03PS2) 10Madhuvishy: labstore: Set mailto address for secondary backups cron [puppet] - 10https://gerrit.wikimedia.org/r/320962 (https://phabricator.wikimedia.org/T144633) 
[07:30:55] <grrrit-wm>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 (owner: 10Marostegui)
[07:31:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1068 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320963 (owner: 10Marostegui)
[07:31:32] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Set mailto address for secondary backups cron [puppet] - 10https://gerrit.wikimedia.org/r/320962 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy)
[07:33:08] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1068 - T149079 (duration: 00m 48s)
[07:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:17] <stashbot>	 T149079: codfw: Fix S4 commonswiki.templatelinks partitions - https://phabricator.wikimedia.org/T149079
[07:33:58] <wikibugs>	 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788316 (10Peachey88) p:05Triage>03Unbreak!
[07:34:10] <p858snake|L2>	 someone want to look at https://phabricator.wikimedia.org/T150503 please?
[07:34:24] <p858snake|L2>	 legoktm: if youa re still around^
[07:38:34] <grrrit-wm>	 (03CR) 10Marostegui: mariadb-labs: Prepare db1095 to be the new sanitarium host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320752 (https://phabricator.wikimedia.org/T149829) (owner: 10Jcrespo)
[07:52:47] <wikibugs>	 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2788321 (10MoritzMuehlenhoff) We already have monitoring for this (implicitly via the connection tracking Icinga check), but more explicit monitoring is under way via...
[08:27:43] <legoktm>	 p858snake|L2: can you reproduce it?
[08:28:59] <wikibugs>	 06Operations: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788064 (10Legoktm) Creating <https://en.wikiversity.org/w/index.php?title=User:Legoktm/sandbox&action=history> worked for me. Is this happening for anyone besides yourself?
[08:44:59] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788354 (10Joe)
[08:45:38] <_joe_>	 are we sure it's an UBN! ticket?
[08:46:42] <_joe_>	 it seems like a thing that's important but not something we should work on non-stop with maximum priority
[08:49:06] <grrrit-wm>	 (03CR) 10Volans: [C: 04-1] "See inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn)
[08:53:33] <moritzm>	 _joe_: it only affects a few hosts and was quickly spotted via the failing conntrack check, but we can just as well keep the prio, I'm working on the dedicated Icinga check later the day
[08:53:57] <_joe_>	 moritzm: what are you referring to?
[08:54:08] <_joe_>	 I was referring to T150503
[08:54:08] <stashbot>	 T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503
[08:54:46] <moritzm>	 oh, sorry, I thought you were referring to T148986, which I commented a few lines above
[08:54:46] <stashbot>	 T148986: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986
[08:56:41] <p858snake|L2>	 _joe_: tbh, if people can't save edits, yes its UBN
[08:56:47] <p858snake|L2>	 legoktm: bit busy to check now
[08:57:05] <_joe_>	 p858snake|L2: I agree, but it's a single report AFAICS
[08:57:42] <_joe_>	 from a few hours ago, if there are more, I agree with you
[08:58:08] <_joe_>	 if not, it can be treated within the flow of "high" priority tickets, IMHO
[08:58:48] <_joe_>	 that's why  I asked for opinions :)
[09:02:40] <icinga-wm>	 PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:21:21] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) 
[09:21:32] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 04-2] Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[09:26:17] <grrrit-wm>	 (03CR) 10Muehlenhoff: "I tested the approach of setting the sysctl settings in a ferm configuration sub file in https://gerrit.wikimedia.org/r/#/c/320590/, but t" [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff)
[09:30:40] <icinga-wm>	 RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[09:33:29] <grrrit-wm>	 (03PS3) 10Elukey: Disable connection tracking for kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/320758 (owner: 10Muehlenhoff)
[09:37:51] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Disable connection tracking for kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/320758 (owner: 10Muehlenhoff)
[09:38:50] <elukey>	 disabled puppet on kafka analytics, will run puppet only on one broker first for --^
[09:38:57] <marostegui>	 !log Deploy schema change s4 commonswiki.revision db1069 - T147305
[09:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:07] <stashbot>	 T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305
[09:50:26] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2788425 (10Marostegui) The data copy finished and after running mysql_upgrade I have started replication and the slaves are catching up nicely with the master.  I forgot to include the RAID config...
[09:51:21] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2788426 (10Marostegui) @Papaul the disks still need to be wiped, is that something you can do or something we have to do?  I will leave this ticket open until you let us know.  Thanks
[09:54:17] <grrrit-wm>	 (03PS6) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 
[10:03:31] <grrrit-wm>	 (03PS7) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 
[10:04:34] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) 
[10:05:04] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff)
[10:05:19] <elukey>	 !log increasing apache log level on mw1284 (depooling, applying config manually, re-pooling with lower weight) for a 503 investigation 
[10:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[10:05:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[10:06:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[10:14:55] <marostegui>	 !log Deploy alter table dbstore1002 s4 commonswiki.revision - T147305
[10:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:02] <stashbot>	 T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305
[10:19:24] <grrrit-wm>	 (03PS8) 10Muehlenhoff: Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 
[10:20:19] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Answered all inline comments, @volans, I also did some basic state mapping as you suggested." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris)
[10:20:46] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) 
[10:22:15] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) 
[10:32:06] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788464 (10ArielGlenn) This setting change means that we'll have more things in memory and that (logically) GC pause...
[10:35:10] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 2 minutes ago with 17 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils]
[10:38:00] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788466 (10ema)
[10:40:22] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788473 (10ema) Log captured with `varnishlog -n frontend -g request -q 'RespStatus eq 503'`  ``` *   << Request  >> 629660955  -   Begin          req 629660954 rxreq -   Timest...
[10:41:56] <wikibugs>	 06Operations, 10Traffic, 05Prometheus-metrics-monitoring: Error collecting metrics from varnish_exporter on some misc hosts - https://phabricator.wikimedia.org/T150479#2788475 (10ema) p:05Triage>03Normal
[10:42:59] <wikibugs>	 06Operations, 10Traffic: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2788477 (10ema) 05Open>03Resolved
[10:51:50] <elukey>	 !log restored mw1284 to its normal settings
[10:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:01] <grrrit-wm>	 (03PS5) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) 
[10:54:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[10:56:53] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788495 (10Paladox) @ArielGlenn so should we revert?  We should try CMS?
[10:58:46] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788499 (10ArielGlenn) Just leave it for now.  If the logs show a sharp enough increase in pause times, I'll report...
[10:59:43] <ema>	 !log cp3043 depooled, testing https://phabricator.wikimedia.org/P4406 (T150503)
[10:59:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:48] <stashbot>	 T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503
[11:07:10] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend - port 3128 on cp3043 is CRITICAL: connect to address 10.20.0.178 and port 3128: Connection refused
[11:07:35] <ema>	 that's me, should be fixed soon ^
[11:08:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend - port 3128 on cp3043 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.168 second response time
[11:10:15] <ema>	 !log cp3043 repooled with gethdr_extrachance=100 (T150503)
[11:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:21] <stashbot>	 T150503: Failure to save recent changes - https://phabricator.wikimedia.org/T150503
[11:10:38] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: grafana: Provision the Server Board dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/320972 
[11:23:20] <icinga-wm>	 PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:24:23] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10MoritzMuehlenhoff) Now we have gerrit running on Debian we also have the option to use openjdk-8 instead...
[11:31:10] <icinga-wm>	 RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[11:34:20] <icinga-wm>	 PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:35:11] <icinga-wm>	 RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[11:45:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[11:48:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[12:11:53] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) 
[12:13:05] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff)
[12:14:32] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) 
[12:46:28] <paravoid>	 moritzm: where is sudo being called?
[12:46:48] <wikibugs>	 06Operations, 13Patch-For-Review: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933#2788706 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None
[12:48:01] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: 4.4-series kernel vs. iptables - https://phabricator.wikimedia.org/T142388#2788708 (10MoritzMuehlenhoff) 05Open>03Resolved This has been fixed,  all labvirt systems are running Linux 4.4 for a while now.
[12:48:24] <grrrit-wm>	 (03PS1) 10BBlack: VCL: fixups for synthetic error status [puppet] - 10https://gerrit.wikimedia.org/r/320975 
[12:49:22] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788712 (10ArielGlenn) >>! In T148478#2788533, @MoritzMuehlenhoff wrote: > Now we have gerrit running on Debian we a...
[12:52:57] <moritzm>	 paravoid: oops, fixed
[12:53:13] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) 
[12:53:57] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788735 (10Paladox) I could do this on the test instance I am using, but it may not work with gerrit 2.12 but may wi...
[12:54:30] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff)
[13:00:01] <grrrit-wm>	 (03PS5) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) 
[13:01:09] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff)
[13:05:02] <grrrit-wm>	 (03PS6) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) 
[13:05:22] <grrrit-wm>	 (03PS2) 10BBlack: VCL: fixups for synthetic error status [puppet] - 10https://gerrit.wikimedia.org/r/320975 
[13:06:25] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff)
[13:08:06] <grrrit-wm>	 (03PS7) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) 
[13:11:05] <moritzm>	 !log installing curl security updates
[13:11:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:50] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:12:14] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788776 (10elukey) From the httpd point of view:   There are a lot of 503s logged for GET requests for /w/api.php like the following:  ``` 2016-11-11T12:07:44 59999926 10.64.0.1...
[13:15:25] <wikibugs>	 06Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#2788778 (10MoritzMuehlenhoff)
[13:31:46] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "LGTM ­— the dependencies (requires) are probably excessive/not very useful (the sudo user doesn't really require the file, and the nrpe de" [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff)
[13:32:02] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff)
[13:34:13] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "See inline for a syntax error. I also still hate the _traditional part. Long-lived certificates are still the norm, and I think having a s" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk)
[13:35:11] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3050721 keys, up 11 days 5 hours - replication_delay is 641
[13:36:10] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3031423 keys, up 11 days 5 hours - replication_delay is 0
[13:40:38] <grrrit-wm>	 (03PS1) 10DCausse: [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T143932) 
[13:40:50] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[13:41:17] <wikibugs>	 06Operations: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532#2788834 (10akosiaris) From a quick look into the Changelogs, 2.7 has nothing backwards incompatible that should worry us, 2.6 does however. Specifically  `The aio=native option to "-drive" now requires the cache=none...
[13:41:26] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T143932) (owner: 10DCausse)
[13:42:28] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: RESTBase config: Use special project for wikidata domains. [puppet] - 10https://gerrit.wikimedia.org/r/320529 (owner: 10Ppchelko)
[13:43:26] <grrrit-wm>	 (03PS2) 10DCausse: [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T115713) 
[13:44:10] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] test job jenkins with mw-core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T115713) (owner: 10DCausse)
[13:46:43] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] RESTBase config: Use special project for wikidata domains. [puppet] - 10https://gerrit.wikimedia.org/r/320529 (owner: 10Ppchelko)
[13:51:10] <grrrit-wm>	 (03CR) 10DCausse: [C: 04-1] "test patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320980 (https://phabricator.wikimedia.org/T115713) (owner: 10DCausse)
[14:02:24] <moritzm>	 !log restarting hhvm on canary app servers to pick up libcurl update
[14:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:46] <mobrovac>	 !log restarting RESTBase to pick up https://gerrit.wikimedia.org/r/#/c/320529/
[14:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:00] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199
[14:09:51] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[14:17:10] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:18:10] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[14:29:53] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788887 (10ema) We've been able to reproduce the bug on pinkunicorn by closing the connection before sending Content-Length bytes as follows:   ``` #!/usr/bin/env python  import...
[14:48:15] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: profile::docker::builder: Conditionalize hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/320985 
[14:59:35] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Update to 4.4.31 [debs/linux44] - 10https://gerrit.wikimedia.org/r/320986 
[15:06:45] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788905 (10elukey) Even simpler:  ``` curl -d "Hola!" --header "Content-Length: 120" --header "Host: en.wikipedia.org" localhost/w/api.php ```  I checked the httpd trunk code an...
[15:22:15] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:23:15] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[15:25:52] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2788064 (10Joe) So to be a bit more precise on what happens on apache:  `mod_proxy_fcgi` reads the request body in a loop, when it gets to the end of input according to the cont...
[15:26:52] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Update to 4.4.31 [debs/linux44] - 10https://gerrit.wikimedia.org/r/320986 
[15:28:15] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701526 (10akosiaris) There is one issue I 'd like to (re?)touch on. Whether explicit hiera() lookups in profiles should have defaults or not (I am assu...
[15:32:13] <grrrit-wm>	 (03PS1) 10Marostegui: mariadb: Split backup class into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 
[15:37:52] <grrrit-wm>	 (03CR) 10Marostegui: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/4586/" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (owner: 10Marostegui)
[15:40:50] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.31 [debs/linux44] - 10https://gerrit.wikimedia.org/r/320986 (owner: 10Muehlenhoff)
[15:44:25] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2788931 (10Dzahn) Since the original now asks for a login, here's the Google cache version to why this was done:  ht...
[15:49:28] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2788933 (10Andrew) A few things about http://garylarizza.com/blog/2014/02/17/puppet-workflow-part-2/:  1)  That argument is premised on a given user hav...
[15:57:00] <grrrit-wm>	 (03Abandoned) 10Muehlenhoff: Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff)
[15:57:56] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2788943 (10MoritzMuehlenhoff)
[16:04:19] <grrrit-wm>	 (03PS1) 10Gehel: Imported Upstream version 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320991 
[16:04:21] <grrrit-wm>	 (03PS1) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) 
[16:04:28] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2788948 (10akosiaris) >>! In T147718#2788933, @Andrew wrote: > A few things about http://garylarizza.com/blog/2014/02/17/puppet-workflow-part-2/: >  > 1...
[16:07:37] <grrrit-wm>	 (03CR) 10Muehlenhoff: New upstream version: 1.11.0 (031 comment) [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) (owner: 10Gehel)
[16:10:46] <grrrit-wm>	 (03PS2) 10Gehel: New upstream version: 1.11.0 [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) 
[16:20:25] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[16:21:25] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3038251 keys, up 11 days 7 hours - replication_delay is 0
[16:21:29] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2788982 (10Andrew) I really don't know how to engage when you assert that you are unable to understand how implicit lookups work.  They're unfamiliar an...
[16:21:51] <grrrit-wm>	 (03PS1) 10Bmansurov: MF Beta: Enable moving first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320993 (https://phabricator.wikimedia.org/T149830) 
[16:27:04] <wikibugs>	 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2788993 (10Aklapper) Outsider comment:  The task summary currently says "Point wikipedia.in to 205.147.101.160 instead of URL forward". If I curr...
[16:28:58] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "I haven't reviewed the patches (and whether they are still needed with the new upstream release) but looks fine in general" [debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/320992 (https://phabricator.wikimedia.org/T150408) (owner: 10Gehel)
[16:34:00] <grrrit-wm>	 (03PS1) 10Rush: tools nfsclient: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/320995 
[16:36:22] <grrrit-wm>	 (03CR) 10Rush: [C: 032] tools nfsclient: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/320995 (owner: 10Rush)
[16:37:05] <wikibugs>	 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2789004 (10Naveenpf) Hi Aklapper,  We are having multiple websites in same server. We are doing the same for all other Indic websites.   [root@e2...
[16:38:53] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2789017 (10Marostegui)
[16:44:05] <icinga-wm>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:51:39] <grrrit-wm>	 (03PS1) 10Rush: nfsclient: fix dependency issue with scratch [puppet] - 10https://gerrit.wikimedia.org/r/320999 
[16:51:47] <grrrit-wm>	 (03PS1) 10Ema: Revert "tlsproxy: turn proxy_request_buffering off for v4" [puppet] - 10https://gerrit.wikimedia.org/r/321000 (https://phabricator.wikimedia.org/T150503) 
[16:53:56] <grrrit-wm>	 (03CR) 10Rush: [C: 032 V: 032] nfsclient: fix dependency issue with scratch [puppet] - 10https://gerrit.wikimedia.org/r/320999 (owner: 10Rush)
[16:54:58] <grrrit-wm>	 (03Abandoned) 10Rush: WIP: candidate idea for secondary backups [puppet] - 10https://gerrit.wikimedia.org/r/319365 (owner: 10Rush)
[16:55:12] <grrrit-wm>	 (03PS2) 10Ema: Revert "tlsproxy: turn proxy_request_buffering off for v4" [puppet] - 10https://gerrit.wikimedia.org/r/321000 (https://phabricator.wikimedia.org/T150503) 
[16:55:20] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] Revert "tlsproxy: turn proxy_request_buffering off for v4" [puppet] - 10https://gerrit.wikimedia.org/r/321000 (https://phabricator.wikimedia.org/T150503) (owner: 10Ema)
[17:00:35] <grrrit-wm>	 (03PS1) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) 
[17:07:15] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:09:32] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2789074 (10madhuvishy)
[17:09:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[17:10:05] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2652289 (10madhuvishy)
[17:10:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[17:12:05] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[17:13:41] <grrrit-wm>	 (03CR) 10Rush: labstore: Dual mount tools from labstore1001 and labstore-secondary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy)
[17:13:45] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] grafana: Provision the Server Board dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/320972 (owner: 10Alexandros Kosiaris)
[17:13:54] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: grafana: Provision the Server Board dashboard as JSON [puppet] - 10https://gerrit.wikimedia.org/r/320972 (owner: 10Alexandros Kosiaris)
[17:15:33] <grrrit-wm>	 (03CR) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy)
[17:16:45] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2789082 (10madhuvishy)
[17:22:53] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Failure to save recent changes - https://phabricator.wikimedia.org/T150503#2789087 (10ema) We've set nginx's proxy_request_buffering back on: https://gerrit.wikimedia.org/r/#/c/321000/ and that seems to help.
[17:23:55] <icinga-wm>	 PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/grafana/dashboards/server-board.json]
[17:31:19] <grrrit-wm>	 (03CR) 10Rush: labstore: Dual mount tools from labstore1001 and labstore-secondary (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy)
[17:31:55] <grrrit-wm>	 (03PS3) 10Rush: labs: add ores_classification and ores_model tables [puppet] - 10https://gerrit.wikimedia.org/r/320804 (https://phabricator.wikimedia.org/T148561) (owner: 10Ladsgroup)
[17:33:55] <icinga-wm>	 PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:36:15] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[17:37:37] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2789124 (10chasemp)
[17:39:02] <grrrit-wm>	 (03PS2) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) 
[17:55:15] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 57 failures. Last run 2 minutes ago with 57 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[17:56:32] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: fixup for I13b135e4 [puppet] - 10https://gerrit.wikimedia.org/r/321012 
[17:57:02] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fixup for I13b135e4 [puppet] - 10https://gerrit.wikimedia.org/r/321012 (owner: 10Filippo Giunchedi)
[18:01:55] <icinga-wm>	 RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[18:04:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[18:05:05] <icinga-wm>	 RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[18:06:33] <wikibugs>	 06Operations: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#2789160 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[18:08:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[18:12:14] <grrrit-wm>	 (03PS14) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) 
[18:16:48] <grrrit-wm>	 (03PS15) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) 
[18:17:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:18:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[18:20:25] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[18:21:25] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3043153 keys, up 11 days 9 hours - replication_delay is 0
[18:23:01] <grrrit-wm>	 (03PS1) 10Yuvipanda: Add libenchant to python(2)? base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/321013 (https://phabricator.wikimedia.org/T143449) 
[18:25:37] <grrrit-wm>	 (03PS5) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) 
[18:27:04] <grrrit-wm>	 (03PS6) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) 
[18:29:38] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032] Add libenchant to python(2)? base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/321013 (https://phabricator.wikimedia.org/T143449) (owner: 10Yuvipanda)
[18:30:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add libenchant to python(2)? base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/321013 (https://phabricator.wikimedia.org/T143449) (owner: 10Yuvipanda)
[18:34:32] <grrrit-wm>	 (03PS16) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) 
[18:52:28] <grrrit-wm>	 (03PS17) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) 
[19:02:15] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[19:04:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[19:08:35] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] Split check_ssl between traditional year-long certs and LE's 3 month certs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk)
[19:09:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[19:10:14] <paladox>	 ^^ that's not me restarting grrrit-wm
[19:10:23] <paladox>	 i haven't restarted it today
[19:19:15] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:20:33] <grrrit-wm>	 (03CR) 10Rush: [C: 031] labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy)
[19:42:26] <Revent>	 Hey guys?
[19:43:02] <Revent>	 Just wondering… is it one of you peeps that’s poking broken transcodes back through the queue on Commons?
[19:44:25] <Revent>	 Those hour+ HD files won’t successfully get through unless they are run one-per-sever at a time… they time out after 6 hours or so, if run several at a time.
[19:44:31] <Revent>	 *server
[19:45:31] <Revent>	 Someone put 4x transcodes of a 2.57GB file on there, at once… it will not work.
[19:48:15] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[19:51:06] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:51:55] <icinga-wm>	 PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:54:13] <grrrit-wm>	 (03PS3) 10Madhuvishy: labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) 
[19:54:20] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Dual mount tools from labstore1001 and labstore-secondary [puppet] - 10https://gerrit.wikimedia.org/r/321001 (https://phabricator.wikimedia.org/T146154) (owner: 10Madhuvishy)
[19:55:05] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:04:05] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:07:14] <grrrit-wm>	 (03PS1) 10Madhuvishy: labstore: Fix service urls for secondary nfs cluster [puppet] - 10https://gerrit.wikimedia.org/r/321017 
[20:08:05] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[20:08:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[20:08:56] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032] labstore: Fix service urls for secondary nfs cluster [puppet] - 10https://gerrit.wikimedia.org/r/321017 (owner: 10Madhuvishy)
[20:10:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[20:15:45] <raynor>	 Hey everyone, I have a question about beta cluster configuration - anyone here to help ?
[20:16:49] <raynor>	 Question - PHP reads host name from config - key `Server` 
[20:17:20] <raynor>	 I just want to check what's under that key for `wikipedia.beta.wmflabs.org`
[20:17:23] <Zppix>	 raynor #wikimedia-releng
[20:17:44] <raynor>	 thx Zppix
[20:17:54] <Zppix>	 no problem
[20:19:55] <icinga-wm>	 RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[20:23:05] <icinga-wm>	 PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:23:19] <Zppix>	 Reedy, or greg-g around?
[20:24:35] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=300.60 Read Requests/Sec=3115.60 Write Requests/Sec=5.50 KBytes Read/Sec=20630.80 KBytes_Written/Sec=2303.60
[20:26:12] <Zppix>	 anyone?
[20:27:00] <paladox>	 grrrit-wm: restart
[20:27:08] <grrrit-wm>	 re-connecting to gerrit
[20:27:09] <grrrit-wm>	 reconnected to gerrit
[20:27:17] <paladox>	 grrrit-wm: force-restart
[20:27:19] <grrrit-wm>	 re-connecting to gerrit and irc.
[20:27:55] <paladox>	 grrrit-wm: nick
[20:28:00] <grrrit-wm>	 re-connected to gerrit and irc.
[20:28:14] <paladox>	 grrrit-wm: nick
[20:28:19] <grrrit-wm>	 Nick is already grrrit-wm not changing the nick.
[20:28:20] <grrrit-wm>	 Nick is already grrrit-wm not changing the nick.
[20:28:25] <paladox>	 grrrit-wm: help
[20:28:27] <grrrit-wm>	 My current commands are: grrrit-wm: restart, grrrit-wm: force-restart,  and grrrit-wm: nick
[20:28:37] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2789330 (10matmarex) It appears that the vast majority of...
[20:33:27] <tiddlywink>	 I am so glad I don't stalk my username on IRC. 
[20:34:17] <Zppix>	 lol
[20:36:35] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=185.60 Read Requests/Sec=164.70 Write Requests/Sec=2.40 KBytes Read/Sec=3716.40 KBytes_Written/Sec=370.00
[20:37:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:37:30] <Zppix>	 bastion-3's ip is blocked on enwiki atm a bot got logged out or something and was editing as bastion-3
[20:37:33] <Zppix>	 just fyi
[20:40:15] <paladox>	 grrrit-wm: restart
[20:40:22] <grrrit-wm>	 re-connecting to gerrit
[20:40:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[20:50:08] <grrrit-wm>	 (03PS6) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) 
[20:51:05] <icinga-wm>	 RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[20:52:13] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "Minimal test scaffolding added" [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi)
[20:53:50] <wikibugs>	 06Operations, 05Prometheus-metrics-monitoring: Deploy federation for Prometheus - https://phabricator.wikimedia.org/T150486#2789344 (10fgiunchedi)
[20:54:31] <Zppix>	 jouncebot now
[20:54:32] <jouncebot>	 No deployments scheduled for the next 65 hour(s) and 5 minute(s)
[20:54:38] <grrrit-wm>	 (03PS1) 10Madhuvishy: exec-manage: Change order of params to support xargs for node names [puppet] - 10https://gerrit.wikimedia.org/r/321022 
[20:56:13] <grrrit-wm>	 (03CR) 10Madhuvishy: [C: 032] exec-manage: Change order of params to support xargs for node names [puppet] - 10https://gerrit.wikimedia.org/r/321022 (owner: 10Madhuvishy)
[20:58:06] <grrrit-wm>	 (03PS1) 10Hashar: jenkins: disable cli [puppet] - 10https://gerrit.wikimedia.org/r/321023 
[20:59:54] <grrrit-wm>	 (03PS2) 10ArielGlenn: jenkins: disable cli [puppet] - 10https://gerrit.wikimedia.org/r/321023 (owner: 10Hashar)
[21:02:29] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] jenkins: disable cli [puppet] - 10https://gerrit.wikimedia.org/r/321023 (owner: 10Hashar)
[21:06:14] <hashar>	 !log Restarted Jenkins
[21:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[21:08:15] <icinga-wm>	 PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused
[21:08:32] <hashar>	 checking
[21:08:37] <apergos>	 thank you
[21:08:54] <apergos>	 didn't have that issue earlier
[21:08:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[21:11:15] <icinga-wm>	 RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888
[21:11:57] <hashar>	 !log jenkins: disabled/reenabled the ZMQ Event Publisher. Apparently it refused to start
[21:12:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:31] <apergos>	 silly thing
[21:12:51] <hashar>	 I am restarting it again to confirm
[21:15:40] <hashar>	 apergos: that was a one time error
[21:15:48] <apergos>	 great
[21:15:58] <hashar>	 !log Restarted Jenkins. This time ZMQ managed to bind to port 8888
[21:16:01] <hashar>	 sounds good to me
[21:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:10] <icinga-wm>	 PROBLEM - MariaDB disk space on labsdb1004 is CRITICAL: DISK CRITICAL - free space: /srv/labsdb 122775 MB (5% inode=99%)
[21:26:41] <godog>	 wah wah, I'll take a look
[21:27:09] <volans>	 godog: ping me if you need a hand ;)
[21:27:17] <paladox>	 grrrit-wm: restart
[21:27:39] <chasemp>	 godog: volans thanks guys :) I think that's the toolsdb slave? I can't recall
[21:27:46] <paladox>	 grrrit-wm: restart
[21:28:50] <godog>	 volans: thanks!
[21:29:06] <godog>	 yeah chasemp I think it is up for reimporting for the jessie migration, https://phabricator.wikimedia.org/T123731
[21:29:34] <godog>	 or not, the sal mentioned only a reimport
[21:31:13] <godog>	 looks like to me the space free has been steadily declined, though there's still space free on the VG
[21:32:21] <marostegui>	 godog: I would say increase the volume
[21:34:07] <godog>	 marostegui: I agree, looks like it started yesterday to go down significantly though
[21:34:24] <volans>	 marostegui: look if there is any offender that is loading a ton of data
[21:34:27] <volans>	 it happened in the past
[21:34:58] <volans>	 that some maintenance or other changes on some tools used a lot of data
[21:35:55] <godog>	 also what was the magic incantation for the mysql client? I'm getting SSL certificate validation failure
[21:36:01] <marostegui>	 godog: you gave it 100G right?
[21:36:10] <icinga-wm>	 RECOVERY - MariaDB disk space on labsdb1004 is OK: DISK OK
[21:36:18] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2789393 (10Joe) >>! In T147718#2788982, @Andrew wrote: > Am I really the only one out here in favor of simply using the language /as it is designed/?...
[21:36:19] <marostegui>	 godog: —skip-ssl but for labsdb it is disable, you need to go thru neodymium in this case
[21:36:22] <godog>	 marostegui: no I didn't touch it
[21:36:27] <Zppix>	 maybe a quick cache clear on the server that was using alot of disk space could help?
[21:36:32] <marostegui>	 volans: nothing is using it now
[21:36:49] <godog>	 akosiaris perhaps expanded it
[21:36:49] <marostegui>	 godog: Interesting the size of the vg went from 2T to 2.1T
[21:36:53] <marostegui>	 ah
[21:36:55] <marostegui>	 ok
[21:37:07] <volans>	 marostegui: I've put a watch with du and saw the data directory growing (ofc I would say), I'm looking inside
[21:37:57] <godog>	 marostegui: yeah I noticed also no .my.cnf, thanks anyways!
[21:39:54] <marostegui>	 there are looots of binlogs for today and yesterday
[21:39:56] <marostegui>	 more than usual
[21:40:07] <volans>	 yes every few minutes
[21:40:26] <volans>	 100MB, 3.4MB, 339 bytes
[21:41:25] <marostegui>	 https://phabricator.wikimedia.org/P4411
[21:41:51] <volans>	 anything on processlist?
[21:41:57] <marostegui>	 nope
[21:42:10] <godog>	 looks like it isn't polled by prometheus ;_;
[21:43:28] <marostegui>	 the last binlog has stopped growing so much
[21:43:33] <marostegui>	 so whatever it is stopped
[21:43:56] <volans>	 marostegui: did you changed it's role in the last days? look bytes in/out on tendril
[21:44:01] <volans>	 since yesterday changed a lot
[21:44:06] <marostegui>	 no
[21:46:28] <godog>	 looks like the same increase in bytes is also on its master labsdb1005
[21:47:13] <marostegui>	 so I guess someone importing stuff? as I said, the binlog has now stopped
[21:47:47] <volans>	 marostegui: from my diffs no single DB has grown in this short amount of time, looks like the space was all from relay logs, that make me think of some update/replace activity
[21:48:27] <marostegui>	 oh, interesting if a db didn't increase, it was probably the relay logs then yes
[21:50:09] <volans>	 ufff... mysqlbinlog: unknown variable 'default-character-set=utf8mb4'
[21:50:37] <marostegui>	 volans: alias mysqlbinlog='/opt/wmf-mariadb10/bin/mysqlbinlog --defaults-file=/root/.my.cnf'
[21:50:52] <volans>	 there is no root/.my.cnf there ;)
[21:50:59] <marostegui>	 ah true :(
[21:51:27] <marostegui>	 Since this is no longer an issue I am going to logoff (i am in a restaurant XD) I will take a look if needed tomorrow!
[21:51:28] <volans>	 but /dev/null works ;)
[21:51:31] <marostegui>	 thanks guys for the help
[21:51:48] <godog>	 np, bye marostegui !
[22:05:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[22:07:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[22:17:05] <icinga-wm>	 PROBLEM - MD RAID on ms-be2025 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:17:15] <icinga-wm>	 PROBLEM - Docker registry HTTP interface on darmstadtium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:17:55] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2025 is CRITICAL: CRITICAL - load average: 249.72, 133.53, 66.40
[22:19:38] <godog>	 poop
[22:21:42] <volans>	 godog: so the RAID check is actually a load check :-P
[22:22:05] <icinga-wm>	 RECOVERY - Docker registry HTTP interface on darmstadtium is OK: HTTP OK: HTTP/1.1 200 OK - 2460 bytes in 0.731 second response time
[22:22:15] <icinga-wm>	 PROBLEM - Disk space on ms-be2025 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda3 is not accessible: Input/output error
[22:23:52] <godog>	 heheh sort of
[22:23:56] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[22:23:56] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[22:23:56] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[22:23:56] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[22:23:56] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[22:24:09] <godog>	 shush
[22:24:35] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[22:28:55] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2025 is OK: OK - load average: 1.46, 0.31, 0.10
[22:28:55] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[22:28:55] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[22:28:55] <icinga-wm>	 RECOVERY - MD RAID on ms-be2025 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[22:29:05] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[22:29:05] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[22:29:05] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[22:29:15] <icinga-wm>	 RECOVERY - Disk space on ms-be2025 is OK: DISK OK
[22:29:35] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2025 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[22:37:58] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 
[22:39:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 (owner: 10Filippo Giunchedi)
[22:39:45] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 
[22:40:33] <Zppix>	 whats going on with jforrester?
[22:40:59] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [V: 032] admin: revoke jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/321081 (owner: 10Filippo Giunchedi)
[22:42:33] <godog>	 Zppix: not sure yet
[22:43:16] <Zppix>	 godog should we riot xD (resuming normal development and stuff i was doing)
[22:43:36] <godog>	 haha no no worries Zppix 
[22:59:31] <wikibugs>	 06Operations, 06Labs, 10wikitech.wikimedia.org: Can't login wikitech - https://phabricator.wikimedia.org/T144805#2610832 (10Shoichi) Months ago, I can log in,but it also happen to me. I don't know rember I had set two-factor authentication or not. Can someone help me? >_< My account there is also "shoichi"
[23:00:50] <paladox>	 grrrit-wm: nick
[23:00:51] <grrrit-wm>	 Nick is already grrrit-wm not changing the nick.
[23:00:59] <paladox>	 grrrit-wm: restart
[23:01:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:01:35] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:01:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:02:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: NoneType object has no attribute get
[23:03:41] <godog>	 -.-
[23:04:03] <godog>	 I'll take a look
[23:04:15] <apergos>	 thanks, I'm about ready to check out (1am)
[23:04:56] <godog>	 hehe enjoy the weekend apergos 
[23:05:05] <apergos>	 thanks, have one yourself soon
[23:05:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[23:10:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[23:11:55] <godog>	 bah there was a restbase deployment earlier today but I'm not sure why the check failed only now
[23:13:29] <Pchelolo>	 godog: there was no deployment today, only yesterday
[23:13:53] <godog>	 Pchelolo: err yeah a config deployment rather
[23:14:05] <godog>	 https://gerrit.wikimedia.org/r/#/c/320529 namely
[23:14:18] <Pchelolo>	 godog: that shouldbe unrelated, it's for wikidata, we're checking on en.wikipedia
[23:15:43] <godog>	 Pchelolo: true, any idea on what it could be?
[23:15:59] <Pchelolo>	 godog: I'm looking right now, all seems fine..
[23:16:33] <Pchelolo>	 I've made the same request checker does via curl - looks ok..
[23:16:58] <Pchelolo>	 but the checker script is failing indeed
[23:18:32] <Zppix>	 i dont know who to talk to but i just wanted to make you guys aware that 10.68.17.205 is blocked on enwiki due to a loggout/anom (confirmed) bot however that ip is confirmed from tools labs possible for you guys to investigate?
[23:20:54] <godog>	 Pchelolo: what puzzles me the most is why now
[23:21:56] <Zppix>	 Pchelolo maybe it lies in a  DB that restbase pulls from? just a thought i have 0 clue about restbase execept that its widely used in MW-related stuff
[23:22:20] <Pchelolo>	 godog: no idea literally. And I'm not quite sure what's happening, RB responds to curl request, nothing in the logs, we didn't deploy anything related since yesterday morning SD
[23:25:27] <Pchelolo>	 godog: got it! the thumbnail of the page is not returned any more for some reason, but we're expecting it
[23:26:00] <Pchelolo>	 godog: here we go, some vandalism: https://en.wikipedia.org/w/index.php?title=Barack_Obama&oldid=749031760
[23:26:11] <Pchelolo>	 it should fix itself in a bit
[23:27:05] <godog>	 ah ah! 
[23:27:36] <Pchelolo>	 once the linksupdatejob will catch up and the pageimage property comes back
[23:27:41] <Pchelolo>	 false alarm :)
[23:28:20] <Zppix>	 Pchelolo if it happens again pm me i am a rollbacker on english wikipedia
[23:29:09] <Pchelolo>	 kk Zppix thank you
[23:29:15] <Zppix>	 what a min why was vandalism causing a restbase issue?
[23:29:18] <Zppix>	 wait*
[23:30:14] <godog>	 I'll file a task for service checker to be a bit more verbose, not sure how you debugged it though Pchelolo 
[23:31:02] <Pchelolo>	 godog: ye, that message was quite misterious
[23:34:28] <grrrit-wm>	 (03CR) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk)
[23:36:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[23:36:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[23:36:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[23:36:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[23:36:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[23:36:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[23:36:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy
[23:36:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[23:36:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[23:36:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy
[23:36:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy
[23:36:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy
[23:36:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[23:36:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[23:36:31] <Pchelolo>	 here we go ^^^
[23:36:39] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[23:36:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[23:38:22] <wikibugs>	 06Operations, 06Operations-Software-Development, 06Services: More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789590 (10fgiunchedi)
[23:38:28] <godog>	 nice, filed as ^
[23:39:10] <wikibugs>	 06Operations, 06Operations-Software-Development, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789602 (10Pchelolo)
[23:47:04] <wikibugs>	 06Operations, 10Traffic: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789604 (10BBlack)
[23:47:47] <grrrit-wm>	 (03PS7) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) 
[23:52:07] <wikibugs>	 06Operations, 10Traffic: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2789618 (10BBlack)