[00:36:37] <icinga-wm_>	 PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:01:10] <icinga-wm_>	 RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:17:49] <icinga-wm_>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:42:36] <icinga-wm_>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:23:05] <icinga-wm_>	 PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:24:27] <icinga-wm_>	 PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:29:56] <icinga-wm_>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 670 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4013713 keys - replication_delay is 670
[02:34:47] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0]
[02:39:37] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[02:47:18] <icinga-wm_>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3989934 keys - replication_delay is 0
[02:47:50] <icinga-wm_>	 RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[02:49:09] <icinga-wm_>	 RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[03:39:07] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2666041 (10Tbayer) Related IRC discussion: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-tech/20160925.txt  (seems to have been resolved already: T146569 ).
[03:51:29] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2666070 (10Peachey88)
[06:28:18] <grrrit-wm1>	 (03Draft2) 10MarcoAurelio: DNS configuration for olo.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/312805 (https://phabricator.wikimedia.org/T146612) 
[06:28:26] <grrrit-wm1>	 (03Draft1) 10MarcoAurelio: DNS configuration for olo.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/312805 (https://phabricator.wikimedia.org/T146612) 
[06:28:43] <grrrit-wm1>	 (03CR) 10MarcoAurelio: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/312805 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio)
[06:53:44] <grrrit-wm1>	 (03Draft2) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) 
[06:53:50] <grrrit-wm1>	 (03Draft1) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) 
[06:58:27] <grrrit-wm1>	 (03Draft2) 10MarcoAurelio: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 
[06:58:31] <grrrit-wm1>	 (03Draft1) 10MarcoAurelio: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 
[07:26:43] <grrrit-wm1>	 (03Draft2) 10MarcoAurelio: Labs configuration for olo.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/312812 (https://phabricator.wikimedia.org/T146612) 
[07:26:48] <grrrit-wm1>	 (03Draft1) 10MarcoAurelio: Labs configuration for olo.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/312812 (https://phabricator.wikimedia.org/T146612) 
[07:27:36] <grrrit-wm1>	 (03PS3) 10MarcoAurelio: RESTBase configuration for olo.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/312808 (https://phabricator.wikimedia.org/T146612) 
[07:35:41] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2666237 (10Marostegui) All good now!  ```                 Device Present                 ================ Virtual Drives    : 1   Degraded        : 0   Offline         : 0 Physical Devices  : 14   Disks...
[07:36:57] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2666238 (10Marostegui) 05Open>03Resolved
[07:51:12] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work): Resolve huge perf regression on autocomplete queries - https://phabricator.wikimedia.org/T146465#2666257 (10dcausse) 05Open>03Resolved a:03EBernhardson Thanks @ema and @EBernhardson !
[07:52:17] <grrrit-wm1>	 (03PS1) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312815 (https://phabricator.wikimedia.org/T146600) 
[08:14:37] <grrrit-wm1>	 (03PS2) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312815 (https://phabricator.wikimedia.org/T146600) 
[08:16:05] <grrrit-wm1>	 (03PS1) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312816 (https://phabricator.wikimedia.org/T146600) 
[08:22:21] <NotASpy>	 morning all. http://wikimediacommons.org/ seems to have stopped working (it's supposed to redirect to commons.wikimedia.org)
[08:22:31] <NotASpy>	 do you want a Phab ticket opened ?
[08:28:37] <apergos>	 NotASpy: yes pelase
[08:29:45] <p858snake|__>	 looks like this was possibly planned https://phabricator.wikimedia.org/T105981#2233631
[08:32:02] <apergos>	 maybe but the entries are still in puppet so that at least could be cleared up on the ticket 
[08:32:08] <apergos>	 modules/mediawiki/files/apache/sites/redirects.conf   this file still has them
[08:32:20] <p858snake|__>	 also https://phabricator.wikimedia.org/T101048
[08:32:42] <p858snake|__>	 but neither of them has recent updates over the last few days/weeks
[08:33:03] <apergos>	 might as well reference them in the new ticket too
[08:33:22] <apergos>	 then hopefully we can get a clear decision / status
[08:35:58] <apergos>	 robh: whenyou get here can you change the topic to make me the clinic duty person?  
[08:36:12] <robh>	 done =]
[08:36:16] <apergos>	 thank you!
[08:42:10] <NotASpy>	 https://phabricator.wikimedia.org/T146619 for your viewing pleasure. 
[08:46:57] <apergos>	 thank you, NotASpy
[08:47:00] <robh>	 redirection policy task, eeeee
[08:47:05] * robh appends project and otherwise avoids
[08:47:32] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666372 (10ArielGlenn) Note that the redirects are still in puppet: modules/mediawiki/files/apache/sites/redirects.conf
[08:47:38] <robh>	 There are strong viewpoints about using all those alternative domain names/urls.
[08:48:13] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666373 (10ArielGlenn) p:05Triage>03Normal
[08:48:53] <apergos>	 robh, should I add you as a subscriber on this ticket?
[08:48:58] <robh>	 hellllll no
[08:49:01] <apergos>	 hahahaha
[08:49:01] <robh>	 ;]
[08:49:22] <apergos>	 NotASpy: note that there will probably be no movement on it this week, the opsen are at an off site so only dealing with emergencies
[08:49:35] <NotASpy>	 I did ask the question if we should just nuke all instances of redirecting domains, quite happy to do that, I have no strong feeling either way. 
[08:50:21] <robh>	 also the policy is a org wide one, so ops cannot make it in a silo
[08:50:34] <robh>	 we do need to decide on a wholesale policy on how to treat redirection by default.
[08:51:48] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666375 (10ArielGlenn) Adding @Krenair and @BBlack from the other tickets.
[08:52:16] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666377 (10Krenair) Looks like it just has no A records in DNS
[08:53:13] <apergos>	 Krenair, yeah they were removed apparently as part of cleanup
[08:53:40] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666378 (10Krenair) Ah, it was https://gerrit.wikimedia.org/r/#/c/244092/
[08:53:43] <apergos>	 but should they have been? then should the redirects go too?  or is it premature, I have no idea
[08:53:44] <apergos>	 so
[08:53:57] <apergos>	 I punt to you folks :-D
[09:09:27] <Nemo_bis>	 Now all the redirecting domains can just use Let's encrypt, can't they
[09:11:04] <Krenair>	 it's not quite that simple due to the scale of the problem
[09:11:20] <Krenair>	 LE limits certs to 100 domains each
[09:13:21] <icinga-wm_>	 PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:14:11] <apergos>	 that limit could do us in pretty quick
[09:15:01] <Krenair>	 pretty sure we're already over that limit
[09:17:10] <Krenair>	 (in terms of number of domains we'd want on there)
[09:17:35] <icinga-wm_>	 PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:21:49] <icinga-wm_>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38458 MB (3% inode=99%)
[09:31:37] <icinga-wm_>	 RECOVERY - Disk space on maps-test2001 is OK: DISK OK
[09:40:30] <icinga-wm_>	 RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:41:02] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2666428 (10Krinkle)
[09:41:13] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494832 (10Krinkle)
[09:42:24] <icinga-wm_>	 RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[09:45:20] <grrrit-wm1>	 (03PS1) 10Yuvipanda: k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 
[09:46:25] <grrrit-wm1>	 (03CR) 10jenkins-bot: [V: 04-1] k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 (owner: 10Yuvipanda)
[09:46:25] <icinga-wm_>	 PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100%
[09:46:32] <icinga-wm_>	 PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:47:35] <grrrit-wm1>	 (03PS2) 10Yuvipanda: k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 
[09:48:10] <robh>	 Jeff_Green: ^ you are workign on betelgeuse right?
[09:48:28] <robh>	 codfw frack host.
[09:48:38] <Jeff_Green>	 robh yeah, i rebooted it for a kernel update and apparently it didn't come back within the 10 minute window of the downtime
[09:48:40] <Jeff_Green>	 looking
[09:48:59] <robh>	 cool, just wanted to ensure you were aware
[09:49:02] <Jeff_Green>	 thx
[09:49:24] <apergos>	 aaaand my android phone has no signal because it probably needs a restart
[09:51:03] <apergos>	 fixed. now I can get pages :-P
[09:54:34] <Bsadowski1>	 \o/
[09:54:44] <Bsadowski1>	 'lo apergos
[09:58:42] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 07Beta-Cluster-reproducible, 07Easy: "Connect to 'deployment.eqiad.wmnet' instead" when you ssh into deployment-tin on Beta - https://phabricator.wikimedia.org/T146505#2666450 (10hashar) p:05Triage>03Low
[10:00:24] <icinga-wm_>	 RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.90 ms
[10:04:22] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666339 (10Bawolff) Stupid question - if its just about cost of certs, can't we use LetsEncrypt?
[10:05:37] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery: Puppet sslcert::ca does not refresh the certificate symlinks when a .crt is updated - https://phabricator.wikimedia.org/T145609#2666467 (10hashar)
[10:06:08] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Discovery: Puppet sslcert::ca does not refresh the certificate symlinks when a .crt is updated - https://phabricator.wikimedia.org/T145609#2635904 (10hashar) p:05Triage>03Normal
[10:06:41] <apergos>	 morning Bsad owski
[10:06:44] <apergos>	 1
[10:06:45] <apergos>	 heh
[10:07:01] <apergos>	 Bsadowski1: 100 cert limit from LE
[10:07:34] <apergos>	 I"m going torun around the corner and get breakfast foods (and lunch at the same time), omg it's already 1 pm
[10:07:36] <apergos>	 brb
[10:09:31] <grrrit-wm1>	 (03CR) 10Marostegui: [C: 031] mariadb: fix class dependency on beta [puppet] - 10https://gerrit.wikimedia.org/r/312652 (owner: 10Hashar)
[10:11:27] <icinga-wm_>	 RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[10:13:47] <wikibugs>	 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Redirect yue.wikipedia.org to zh-yue.wikipedia.org - https://phabricator.wikimedia.org/T105999#2666481 (10Liuxinyu970226)
[10:20:48] <wikibugs>	 07Puppet: Investigate usage of hiera_hash in our puppet repo - https://phabricator.wikimedia.org/T146621#2666483 (10yuvipanda)
[10:20:51] <wikibugs>	 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2666500 (10hashar)
[10:20:54] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2666497 (10hashar) 05Open>03Resolved a:03hashar I have removed the group.  Every project members already had root access anyway.
[10:22:11] <wikibugs>	 07Puppet, 06Labs, 10Labs-Infrastructure: Investigate usage of hiera_hash in our puppet repo - https://phabricator.wikimedia.org/T146621#2666501 (10Andrew)
[10:22:55] <apergos>	 oops forgot to say: back. 
[10:36:03] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Discovery, 10Wikimedia-Portals, 13Patch-For-Review: beta-mediawiki-config-update-eqiad failing with merge conflict in portals - https://phabricator.wikimedia.org/T129427#2666573 (10hashar) 05Open>03Resolved Havent seen this one happening again. I am assuming...
[10:36:47] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2666578 (10hashar)
[10:37:40] <grrrit-wm1>	 (03PS2) 10Hashar: beta: drop deployment-tin add deployment-tin02 [puppet] - 10https://gerrit.wikimedia.org/r/312654 (https://phabricator.wikimedia.org/T144006) 
[10:38:26] <wikibugs>	 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2666597 (10hashar) The deployment servers have been reimaged to Jessie:  * deployment-mira * deployment-tin02  Last patch to land is https://gerrit.wiki...
[10:42:21] <wikibugs>	 06Operations, 10hardware-requests, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Allocate contint1001 to releng and allocate to a vlan - https://phabricator.wikimedia.org/T140257#2666607 (10hashar) 05Open>03Resolved We had contint1001 allocated. It has a p...
[10:48:42] <icinga-wm_>	 PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:50:11] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[10:50:58] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666629 (10yuvipanda)
[10:52:46] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666647 (10yuvipanda)
[10:53:20] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2666650 (10deryckchan) As someone without developer access, I gather from the discussion so far (over 7 years) that we're solving four different problems of various...
[10:55:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[10:56:47] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2666657 (10hashar)
[10:56:50] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666656 (10hashar)
[10:57:07] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2666629 (10hashar) p:05Triage>03Normal
[11:00:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:05:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:08:38] <apergos>	 rats, no Jeff_Green for lutetium
[11:10:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:13:33] <icinga-wm_>	 RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:15:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:16:11] <apergos>	 sms sent
[11:20:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:25:14] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:30:06] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:33:06] <icinga-wm_>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:35:08] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:40:08] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:45:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:50:10] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:55:07] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[11:57:44] <icinga-wm_>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:00:14] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[12:05:09] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[12:10:12] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[12:14:35] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2666887 (10Verdy_p) "3. Change the "traditional" MediaWiki interwiki prefix (not so important because Wikidata has made that mostly obsolete)"  That's wrong, we ver...
[12:15:11] <icinga-wm_>	 PROBLEM - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2)
[12:18:34] <icinga-wm_>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:18:59] <icinga-wm_>	 ACKNOWLEDGEMENT - check_mysql on lutetium is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) arielglenn will be investigated in a little while (msg from Jeff)
[12:20:42] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown: Reboot snapshot servers - https://phabricator.wikimedia.org/T146127#2666894 (10ArielGlenn) snapshot1001,5 done. The other two have jobs running on them that will finish up Tuesday.
[12:29:52] <grrrit-wm1>	 (03CR) 10Hashar: "Yup the recursive strategy causes "git rebase" to always rebase even when the end result would be a noop :D" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar)
[12:40:44] <icinga-wm_>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 661 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3980966 keys - replication_delay is 661
[12:43:30] <icinga-wm_>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[12:52:45] <icinga-wm_>	 PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:53:14] <hashar>	 ^^ Phabricator cant reach the database
[12:54:01] <icinga-wm_>	 PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:54:07] <apergos>	 looking
[12:54:09] <godog>	 we're looking
[12:54:13] <apergos>	 oh good
[12:58:39] <Thibaut120094>	 and it's back :)
[12:58:48] <icinga-wm_>	 RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27146 bytes in 0.227 second response time
[13:00:08] <icinga-wm_>	 RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 27146 bytes in 0.217 second response time
[13:00:17] <hashar>	 magic
[13:00:44] <Elitre>	 phew, TY!
[13:01:18] <apergos>	 I was looking but got nowhere near to located the issue before the recovery
[13:09:05] <chasemp>	 apergos: I think jynus has a fair idea about a large delete causing a lock and so unavailability, he's giving #releng a heads up I believe
[13:09:22] <apergos>	 oh my
[13:09:28] <apergos>	 good to know, thanks
[13:10:13] <icinga-wm_>	 RECOVERY - check_mysql on lutetium is OK: Uptime: 1332 Threads: 1 Questions: 123396 Slow queries: 15 Opens: 8035 Flush tables: 2 Open tables: 64 Queries per second avg: 92.639 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[13:13:25] <icinga-wm_>	 PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:17:16] <icinga-wm_>	 PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:22:06] <andre__>	 Phabricator seems to have DB issues again for me - "Can Not Connect to MySQL"
[13:22:08] <hoo>	 Phabricator can't connect to MySQL
[13:22:11] <hoo>	 jynus_: ^
[13:22:15] <andre__>	 same as 20min ago
[13:23:09] <icinga-wm_>	 PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:24:01] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "very nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/312748 (https://phabricator.wikimedia.org/T131946) (owner: 10Hashar)
[13:24:26] <icinga-wm_>	 PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:26:50] <icinga-wm_>	 RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27147 bytes in 4.656 second response time
[13:28:01] <icinga-wm_>	 RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 27146 bytes in 0.201 second response time
[13:29:00] <apergos>	 maybe jynus you would peek at that a bit more? ^^ since it's flapping a little
[13:29:23] <apergos>	 hm not in here 
[13:29:58] <wikibugs>	 06Operations, 10Traffic: Remove "GeoIP lookup" service from https://status.wikimedia.org - https://phabricator.wikimedia.org/T146638#2666993 (10Aklapper)
[13:30:24] <chasemp>	 apergos: it's dropping in and out based on an aria locking issue, jynus is going ot get into it and we'll regenarate the search indexes if needed
[13:30:27] <yuvipanda>	 apergos: he's working on it (sitting next to me)
[13:30:36] <apergos>	 awesome, thank you
[13:30:41] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262#2667101 (10Gilles)
[13:30:43] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2667032 (10deryckchan) @Verdy_p I agree with you.   > - adding an interwiki and aliasing the former one, and checking that "#language:" correctly resolves both code...
[13:30:51] <icinga-wm_>	 PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:38:39] <icinga-wm_>	 RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[13:42:19] <marostegui>	 The table is being converted to InnoDB, it is still running
[13:43:55] <icinga-wm_>	 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1
[13:44:06] <icinga-wm_>	 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1
[13:44:56] <icinga-wm_>	 RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[13:49:36] <wikibugs>	 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2667102 (10Arseny1992)
[13:50:05] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: db1060: Degraded RAID - https://phabricator.wikimedia.org/T146449#2667177 (10Marostegui) thanks
[13:50:53] <marostegui>	 The table is now converted
[13:51:19] <marostegui>	 Searching works, but it doesn't give results. We are going to regenerate the search index
[13:51:25] <icinga-wm_>	 RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0
[13:51:36] <icinga-wm_>	 RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0
[13:52:18] <chasemp>	 !log iridium phab ./bin/search index --all
[13:52:22] <marostegui>	 !log phabricator is back in write mode - search is degraded. we are regenerating the indexes
[13:52:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:52:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:55:47] <icinga-wm_>	 RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[14:01:33] <wikibugs>	 06Operations: sftp gives bogus "Couldn't stat remote file: No such file or directory" - https://phabricator.wikimedia.org/T146509#2663619 (10ArielGlenn) Is there a file of that name there already?  If not, maybe you don't want 'resume'...
[14:18:08] <wikibugs>	 06Operations, 10Monitoring, 06Performance-Team, 06Release-Engineering-Team, 07Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#2651529 (10ori) #performance-team is considering making this the focus of our off-site.
[14:21:42] <wikibugs>	 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: 502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service - https://phabricator.wikimedia.org/T146576#2667326 (10Esc3300) Thanks for fixing this. Afterwards, occasionally, I got old data and the "data updated"...
[14:24:07] <hashar>	 jynus_: marostegui: well done :]
[14:24:21] <hashar>	 heavy search and MyISAM aren't playing nice are they?
[14:34:59] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666339 (10AlexMonk-WMF) >>! In T146619#2666465, @Bawolff wrote: > Stupid question - if its just about cost of certs, can't we use LetsEncrypt?  It's not that stupid. Let's Encrypt...
[14:42:51] <wikibugs>	 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2667400 (10ArielGlenn) I understand there is a 100 cert limit for Let's Encrypt. Looking at this: https://letsencrypt.org/docs/rate-limits/ it's not clear to me the exact limits.
[14:44:42] <grrrit-wm>	 (03PS3) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 
[14:44:49] <wikibugs>	 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2667404 (10Krenair) >>! In T21986#2666650, @deryckchan wrote: > As someone without developer access, I gather from the discussion so far (over 7 years) that we're s...
[14:44:55] <grrrit-wm>	 (03CR) 10EBernhardson: "seems there isn't any harm in having both, updated." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson)
[14:48:56] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2667411 (10yuvipanda) I'm no longer convinced we have to do this. https://phabricator.wikimedia.org/T91990 should cover most of the things we want from this.
[14:51:06] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "Looks like everywhere that this change needs to be made." [puppet] - 10https://gerrit.wikimedia.org/r/312654 (https://phabricator.wikimedia.org/T144006) (owner: 10Hashar)
[15:17:07] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "lgtm. Just created a task to remove the hard-coded upstart commands @hashar found in scap: T146656" [puppet] - 10https://gerrit.wikimedia.org/r/312705 (owner: 10EBernhardson)
[15:22:59] <grrrit-wm>	 (03PS4) 10EBernhardson: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) 
[15:23:30] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] "Neat thank you Erik. Tyler filled T146656 to track the removal of the old commands." [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson)
[15:25:43] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2667483 (10Gilles)
[15:25:46] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2667484 (10Gilles)
[15:25:49] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2636021 (10Gilles)
[15:25:52] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2667487 (10Gilles)
[15:25:55] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2635959 (10Gilles)
[15:25:58] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262#2667490 (10Gilles)
[15:26:00] <grrrit-wm>	 (03PS3) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) 
[15:26:01] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Make the 100MB+ test files downloaded from their source instead of being in the git repo - https://phabricator.wikimedia.org/T145785#2667488 (10Gilles)
[15:26:04] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2667489 (10Gilles)
[15:44:30] <apergos>	 I"m here but for te next while if someone needs me they should ping, I won't be paying close attention
[15:53:20] <wikibugs>	 06Operations, 10Recommendation-API: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2667567 (10ori) 05Open>03declined >>! In T133362#2575978, @yuvipanda wrote: > I also think deb packaging for this is going town a long, unrecoverable rabbit hole, and wo...
[15:56:28] <grrrit-wm>	 (03PS1) 10Urbanecm: [throttle] Rule for Winona State University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312842 (https://phabricator.wikimedia.org/T146600) 
[16:00:25] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#2667598 (10N...
[16:03:41] <wikibugs>	 06Operations: create notifications about user accounts that have not been used for a long time - https://phabricator.wikimedia.org/T146657#2667606 (10Dzahn)
[16:36:45] <wikibugs>	 06Operations, 10ORES: Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2667780 (10GWicke)
[16:37:04] <wikibugs>	 06Operations, 10ORES, 06Services: Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2667792 (10GWicke)
[16:55:27] <grrrit-wm>	 (03PS1) 10Urbanecm: [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) 
[16:56:04] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) (owner: 10Urbanecm)
[16:56:22] <icinga-wm_>	 PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:08:40] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#2667997 (10a...
[17:21:23] <icinga-wm_>	 RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:22:42] <icinga-wm_>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api
[17:25:05] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:30:32] <apergos>	 citoid. sigh
[17:30:34] <apergos>	 nothing full
[17:35:15] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:37:36] <icinga-wm_>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[17:39:41] <apergos>	 "recovered" ok but without my intervention
[17:45:12] <icinga-wm_>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api
[17:46:07] <apergos>	 and there we are again
[17:56:53] <icinga-wm_>	 PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:57:33] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[18:01:49] <apergos>	 the local tests all pan out fine, I don't see anything newly bad in the logs, etc
[18:02:33] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:02:38] <apergos>	 eyeroll
[18:02:40] <apergos>	 thanks
[18:10:14] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:15:05] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:15:28] <apergos>	 groan
[18:15:38] <apergos>	 and as before local tests return results
[18:15:54] <ori>	 how are you running them, and where?
[18:16:11] * ori is kibitzing
[18:17:33] <icinga-wm_>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[18:17:34] <apergos>	 thank please do, because I am getting nowhere
[18:17:45] <apergos>	 run on scb1001,2
[18:18:00] <apergos>	 https://wikitech.wikimedia.org/wiki/Citoid  under the 'testing' section
[18:21:43] <icinga-wm_>	 RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:22:43] <apergos>	 I see entries ticking by on sca1001 in the  zotero log so I really dunno
[18:23:59] <ori>	 request latency is very uneven
[18:24:04] <ori>	 the check that issues this alert is check_wmf_service!http://citoid.svc.codfw.wmnet:1970!15
[18:24:21] <ori>	 check_wmf_service is /usr/bin/service-checker-swagger -t $ARG2$ $HOSTNAME$ $ARG1$
[18:24:35] <ori>	 so the invocation is: /usr/bin/service-checker-swagger -t 15 localhost 'http://citoid.svc.codfw.wmnet:1970'
[18:24:41] <ori>	 (assuming you are on scb1001,2)
[18:24:57] <apergos>	 well I can try the icinga check certainly
[18:25:03] <icinga-wm_>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api
[18:25:30] <ori>	 if I do: while true; do time /usr/bin/service-checker-swagger -t 15 localhost 'http://citoid.svc.codfw.wmnet:1970' ; sleep 1 ; done
[18:25:45] <ori>	 sometimes I get 'All endpoints are healthy' in a few seconds, sometimes it takes over a minute
[18:25:49] <ori>	 at least once I got a timeout
[18:26:03] <apergos>	 codfw?
[18:26:18] <apergos>	 these should be for eqiad
[18:26:27] <ori>	 hah, yes, that's wrong. but it's hitting localhost, so the vhost name shouldn't matter much
[18:27:16] <apergos>	 my check just told me healthy
[18:27:17] <apergos>	 meh
[18:27:24] <ori>	 run it in a loop
[18:27:30] <apergos>	 yeah I'm about to run it a few times
[18:28:46] <apergos>	 for i in `seq 20`...
[18:28:55] <apergos>	 and a 5 sec sleep, let's see what happens
[18:29:11] <apergos>	 or is that "you won't believe what happens next" :-P
[18:29:16] <ori>	 service-checker-swagger is a python util written by _joe_ apparently: https://github.com/lavagetto/service-checker/blob/master/checker/service.py . i'd live-hack it to print latency for each request it makes. it uses the spec to make multiple reqs
[18:29:48] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:30:20] <apergos>	 two healthies and a timeout so far
[18:30:29] <hashar>	 !sal
[18:30:29] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Server_Admin_Log  https://tools.wmflabs.org/sal/production   See it and you will know all you need.
[18:31:56] <apergos>	 it didn't help me here, hashar :-P
[18:32:23] <hashar>	 ah that was for me
[18:32:27] <apergos>	 ah ha :-D
[18:32:33] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[18:32:38] <hashar>	 wanted to check whether someone noticed logrotate failing on fluorine
[18:32:55] <hashar>	 well not failling
[18:33:02] <hashar>	 just logrotated later than I would have expected
[18:37:03] <icinga-wm_>	 PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:37:35] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:39:17] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[18:39:56] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:41:47] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[18:43:57] <apergos>	 six timeouts out of twenty
[18:44:19] <apergos>	 ganglia doesnt' show any big changes in load on sca* or scb* over the last 4 hours
[18:47:26] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:48:47] <ori>	 mind if I live-hack /usr/lib/python2.7/dist-packages/servicechecker/swagger.py to print latency and URL for each request?
[18:48:49] <ori>	 apergos: ^
[18:48:51] <ori>	 on scb1001
[18:48:51] <apergos>	 hashar: seeing as I'm getting nowhere on the flapping citoid, what log rot were you looking for in particular on fluorine?
[18:49:02] <apergos>	 ori: go ahead, please make a copy of the orig in the same place
[18:49:06] <ori>	 yep
[18:49:19] <apergos>	 I mean it's certainly well over the 10 sec reply that icinga wants, it seems
[18:49:25] <ori>	 .py -> .bak
[18:49:30] <apergos>	 cool
[18:49:52] <hashar>	 apergos: /a/mw-logs/api.log which has its first entry at roughly 8:40
[18:49:59] <hashar>	 would have expected 6:45 or so
[18:50:03] <apergos>	 ah
[18:50:05] <apergos>	 lemme see 
[18:51:29] <apergos>	 the api-feature log has messages starting at 6:45
[18:51:32] <hashar>	 apergos: it is not important really
[18:51:48] <apergos>	 so general log rot went off on time as usual
[18:52:07] <hashar>	 actually
[18:52:14] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[18:52:18] <hashar>	 we might want to drop api.log entirely
[18:52:23] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:52:24] <hashar>	 I am not sure whether it has a purpose anymore
[18:53:06] <apergos>	 but many many of these logs appear to have starting times of 08:45 or so
[18:53:18] <apergos>	 as though one restart took a very long time
[18:53:33] <hashar>	 that log every single requests made to the api
[18:53:55] <ori>	 sometimes it takes very long to gzip
[18:54:22] <hashar>	 maybe we can logrotate it on an hourly basis
[18:54:33] <hashar>	 and look at stopping those logs
[18:55:28] <apergos>	 I see:
[18:55:35] <apergos>	 there are a number of logs in here of course
[18:55:41] <apergos>	 some take a little while to do the rot/gzip
[18:55:43] <apergos>	 that time adds up
[18:56:18] <ori>	 it's the /api?search=http%3A%2F%2Fexample.com&format=bibtex request
[18:56:21] <ori>	 (re: citoid)
[18:56:22] <apergos>	 the api log is 25 down in the list or so
[18:56:44] <hashar>	 apergos: dont waste your time on the logrotation :] 
[18:57:09] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:57:11] <apergos>	 centralauthrename takes 20 minutes all by itself
[18:57:15] <apergos>	 so that's your answer
[18:57:24] <apergos>	 no time wasted, but no fix either :-D
[18:57:44] <apergos>	 ori: wait what? 
[18:58:35] <apergos>	 I mean I see that's the one with whining about the timeout
[18:58:39] <apergos>	 but what is that even
[18:58:41] <ori>	 the checker checks multiple endpoints: /, /_info, and /api. It's the latter one
[18:58:42] <apergos>	 example.com?
[18:59:02] <apergos>	 oh the search.... 
[18:59:11] <apergos>	 groan
[18:59:29] <apergos>	 now officially and completely out of my depth :-P
[18:59:43] <icinga-wm_>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[19:01:48] <icinga-wm_>	 RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[19:07:04] <apergos>	 I would expect the checker to use one of the examples in here: https://github.com/zotero/translators  or am I completely off base?
[19:14:56] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668421 (10awight) I'm digging a rabbithole through the cache vaults, and found this interesting resul...
[19:18:12] <apergos>	 akosiaris, if/when you are around would you mind having a look in on the citoid issue? (see scrollback)
[19:20:57] <apergos>	 I'm officially off unless there's an emergency site-wise
[19:21:04] <apergos>	 of course so is the rest of ops :-/
[19:21:15] <apergos>	 once a year, that's how it is
[19:29:05] <icinga-wm_>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:34:54] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:34:54] <icinga-wm_>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:37:10] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[19:37:10] <icinga-wm_>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[19:40:27] <icinga-wm_>	 PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf]
[19:45:55] <grrrit-wm>	 (03PS1) 10Catrope: Follow-up fd8998a4ec9: remove another stray $wmgMFUseCentralAuthToken reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312867 
[19:46:04] <RoanKattouw>	 jdlrobson: ---^^
[19:48:44] <icinga-wm_>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38573 MB (3% inode=99%)
[19:51:39] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668518 (10awight) I'm using something closer to the original command for probing the message cache, a...
[19:52:44] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] "Looks fun :] Filippo would Prometheus be able to replace Shinken/Icinga? Alarming on a graphite metric is a typical use case for us (both" [puppet] - 10https://gerrit.wikimedia.org/r/304263 (https://phabricator.wikimedia.org/T141785) (owner: 10Thcipriani)
[19:53:55] <icinga-wm_>	 RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[19:58:20] <grrrit-wm>	 (03PS1) 10Kaldari: Deploying PageAssessments to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312869 (https://phabricator.wikimedia.org/T146679) 
[20:05:24] <icinga-wm_>	 RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:07:54] <icinga-wm_>	 PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:13:10] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668547 (10spatton) @awight our typical practice is to disable a given campaign before we swap new ban...
[20:35:21] <icinga-wm_>	 RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:54:44] <icinga-wm_>	 PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:55:09] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 28806 seconds ago, expected 28800
[21:00:09] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29106 seconds ago, expected 28800
[21:05:09] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29406 seconds ago, expected 28800
[21:07:55] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2668865 (10awight) @spatton I've changed some things about the backend in order to help diagnose this...
[21:10:09] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29706 seconds ago, expected 28800
[21:15:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30006 seconds ago, expected 28800
[21:16:12] <icinga-wm_>	 PROBLEM - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 38732 MB (3% inode=99%)
[21:20:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30306 seconds ago, expected 28800
[21:22:34] <icinga-wm_>	 RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[21:25:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30606 seconds ago, expected 28800
[21:30:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30906 seconds ago, expected 28800
[21:35:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31206 seconds ago, expected 28800
[21:40:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31506 seconds ago, expected 28800
[21:45:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31806 seconds ago, expected 28800
[21:50:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32106 seconds ago, expected 28800
[21:55:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32406 seconds ago, expected 28800
[22:00:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32706 seconds ago, expected 28800
[22:05:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33007 seconds ago, expected 28800
[22:10:15] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33306 seconds ago, expected 28800
[22:15:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33606 seconds ago, expected 28800
[22:18:01] <grrrit-wm>	 (03PS5) 10Hoo man: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 
[22:19:09] <grrrit-wm>	 (03CR) 10Hoo man: [C: 031] "Fixed a few minor things. Manually verified this by dumping testwikidatawiki to /tmp on snapshot1007." [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man)
[22:20:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33907 seconds ago, expected 28800
[22:24:58] <wikibugs>	 06Operations, 10Mail, 07LDAP, 13Patch-For-Review: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2669082 (10bbogaert) Hi,  I made the change to corp LDAP. I have been able to add the wikimediaPerson objectClass and YubiKeyVPN attribute to myself. Can you check if LDAP...
[22:25:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 34207 seconds ago, expected 28800
[22:27:11] <grrrit-wm>	 (03PS1) 10Jon Harald Søby: Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312944 (https://phabricator.wikimedia.org/T146707) 
[22:30:06] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 34506 seconds ago, expected 28800
[22:31:06] <greg-g>	 (sorry for the ping, sir)
[22:35:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 34807 seconds ago, expected 28800
[22:36:59] <greg-g>	 those are annoying
[22:40:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 35106 seconds ago, expected 28800
[22:45:15] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 35406 seconds ago, expected 28800
[22:50:08] <grrrit-wm>	 (03PS1) 10Thcipriani: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 
[22:50:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 35707 seconds ago, expected 28800
[22:55:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36006 seconds ago, expected 28800
[23:00:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36306 seconds ago, expected 28800
[23:05:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36606 seconds ago, expected 28800
[23:10:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 36906 seconds ago, expected 28800
[23:10:24] <greg-g>	 we get it
[23:11:44] <Bsadowski1>	 heh
[23:12:01] <Bsadowski1>	 I found this amusing: "Puppet last ran 28806 seconds ago, expected 28800"
[23:12:21] <Bsadowski1>	 :p
[23:12:38] <greg-g>	 so close.
[23:15:08] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 37206 seconds ago, expected 28800
[23:18:49] <icinga-wm_>	 PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 100.07, 100.55, 99.10
[23:19:00] <greg-g>	 now that's not nothing ^
[23:20:08] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 37506 seconds ago, expected 28800
[23:21:08] <grrrit-wm>	 (03CR) 10Mattflaschen: [C: 04-1] "Although flow_computed/flow-computed could be used at runtime, it intentionally is not, for performance reasons." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson)
[23:22:16] <grrrit-wm>	 (03CR) 10Mattflaschen: "Also, you moved it out of the dblist directory. It should stay there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309186 (owner: 10Dereckson)
[23:25:10] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 37806 seconds ago, expected 28800
[23:29:42] <wikibugs>	 06Operations, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669336 (10Halfak)
[23:30:10] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 38106 seconds ago, expected 28800
[23:30:19] <halfak>	 Hey folks, I'm looking for help getting paging (email) from icinga for ores.wikimedia.org. 
[23:30:20] <halfak>	 See https://phabricator.wikimedia.org/T146720
[23:30:34] <wikibugs>	 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669349 (10Halfak)
[23:30:39] <greg-g>	 halfak: offsite this week in Spain, they're probably all off drinking/sleeping
[23:30:48] <halfak>	 Oh yeah.  Thanks.  
[23:30:50] <greg-g>	 :)
[23:30:59] <greg-g>	 but, +1 to the request :)
[23:31:00] <halfak>	 But then again, this is helping them fill the gap of awayness. 
[23:31:16] <halfak>	 Hmm... maybe I can wake up really early tomorrow and call in a favor 
[23:31:39] <greg-g>	 halfak: I replied just to Amir re the incident, should have adde dyou, but: a quick/short incident report with follow-ups (like that) would be good to have
[23:31:40] <halfak>	 I just realized I get pings when ores.wmflabs.org goes down -- but not prod!
[23:31:52] <halfak>	 greg-g, in progress!
[23:32:05] <greg-g>	 be still my heart
[23:32:10] <greg-g>	 :)
[23:33:27] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2669351 (10awight) p:05Unbreak!>03High Reducing the priority because we're not actively losing ban...
[23:35:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 38406 seconds ago, expected 28800
[23:40:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 38706 seconds ago, expected 28800
[23:40:25] <grrrit-wm>	 (03CR) 10Alex Monk: "more details about the bug in the commit message of I815ae9e5" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[23:41:32] <icinga-wm_>	 PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 102.87, 101.37, 99.52
[23:43:32] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] "maybe clarify the commit message, but the code appears to work" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[23:45:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39007 seconds ago, expected 28800
[23:50:12] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39306 seconds ago, expected 28800
[23:55:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39607 seconds ago, expected 28800
[23:57:13] <icinga-wm_>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues