[00:00:14] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39907 seconds ago, expected 28800
[00:04:05] <icinga-wm_>	 PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 100.17, 100.14, 99.97
[00:05:14] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 40206 seconds ago, expected 28800
[00:10:14] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 40506 seconds ago, expected 28800
[00:15:14] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 40807 seconds ago, expected 28800
[00:20:14] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 41106 seconds ago, expected 28800
[00:22:14] <icinga-wm_>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[00:25:14] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 41406 seconds ago, expected 28800
[00:29:08] <ori>	 jesus christ that barium alert
[00:29:21] <ori>	 it's a fundraising cluster host
[00:30:15] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 41706 seconds ago, expected 28800
[00:35:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42006 seconds ago, expected 28800
[00:40:16] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42307 seconds ago, expected 28800
[00:44:47] <icinga-wm_>	 PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:45:07] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42606 seconds ago, expected 28800
[00:50:17] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42906 seconds ago, expected 28800
[00:55:07] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 43206 seconds ago, expected 28800
[01:00:17] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 43506 seconds ago, expected 28800
[01:05:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 43806 seconds ago, expected 28800
[01:10:02] <icinga-wm_>	 RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:10:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 44106 seconds ago, expected 28800
[01:12:52] <icinga-wm_>	 PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:15:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 44406 seconds ago, expected 28800
[01:20:11] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 44706 seconds ago, expected 28800
[01:23:01] <icinga-wm_>	 PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:25:13] <icinga-wm_>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 45006 seconds ago, expected 28800
[01:30:16] <icinga-wm_>	 RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 133 seconds ago with 0 failures
[01:40:30] <icinga-wm_>	 RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:48:10] <icinga-wm_>	 RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:35:26] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
[02:42:46] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[03:12:47] <icinga-wm_>	 PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:30:02] <icinga-wm_>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3883515 keys - replication_delay is 0
[03:39:57] <icinga-wm_>	 RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[03:49:57] <grrrit-wm>	 (03CR) 10Legoktm: "T146619 was filed about .org no longer redirecting." [dns] - 10https://gerrit.wikimedia.org/r/244092 (owner: 10Dzahn)
[03:57:17] <icinga-wm_>	 PROBLEM - Disk space on ms-be2009 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error
[03:58:36] <icinga-wm_>	 PROBLEM - MegaRAID on ms-be2009 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[04:02:27] <icinga-wm_>	 RECOVERY - Disk space on ms-be2009 is OK: DISK OK
[04:19:41] <AaronSchulz>	 legoktm: https://gerrit.wikimedia.org/r/#/c/312945/
[04:19:55] <legoktm>	 hi AaronSchulz :)
[04:20:32] <legoktm>	 AaronSchulz: you should do that in Wikibase too ;)
[04:21:05] <AaronSchulz>	 me hellos are of  the form <PREAMBLE><ID>
[04:21:29] <AaronSchulz>	 PREABLE= https://gerrit.wikimedia.org/r/#/c/; ID=\d+
[04:22:01] <legoktm>	 :P
[04:22:58] <AaronSchulz>	 I CC;'d you on a class_exists() thread
[04:22:59] <icinga-wm_>	 PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdd1]
[04:23:10] <AaronSchulz>	 could be useful for DatabaseBase
[04:25:29] <legoktm>	 AaronSchulz: why class_alias over class DatabaseBase extends Database {} ?
[04:25:40] <legoktm>	 (or the other way around)
[04:25:56] <legoktm>	 also, we're in the wrong channel ;)
[04:41:59] <grrrit-wm>	 (03CR) 1020after4: [C: 031] Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[04:45:40] <icinga-wm_>	 PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:10:38] <icinga-wm_>	 RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[05:32:13] <_joe_>	 !log rebooting ms-be1002, stuck in a failed disk
[05:32:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:37:40] <icinga-wm_>	 PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100%
[05:40:44] <icinga-wm_>	 ACKNOWLEDGEMENT - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Failing vd-11
[06:35:41] <icinga-wm_>	 PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tree]
[07:00:11] <icinga-wm_>	 RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:21:46] <grrrit-wm>	 (03PS1) 1020after4: Move scap package version to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/312971 
[07:27:07] <grrrit-wm>	 (03PS2) 1020after4: Move scap package version to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) 
[07:47:54] <wikibugs>	 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2669818 (10BBlack)
[07:47:57] <wikibugs>	 06Operations, 10Traffic: Remove "GeoIP lookup" service from https://status.wikimedia.org - https://phabricator.wikimedia.org/T146638#2669815 (10BBlack) 05Open>03Resolved a:03BBlack Thanks for finding this, should've removed before service decom.  It's gone now.
[07:59:14] <wikibugs>	 06Operations, 10Mail, 07LDAP, 13Patch-For-Review: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2669822 (10MoritzMuehlenhoff) 05Open>03Resolved @bbogaert : Confirmed, the users awight, bboegart and bbunny have the attribute in the corp LDAP mirror.
[08:07:53] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2669825 (10Gilles)
[08:08:22] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[08:12:06] <grrrit-wm>	 (03PS6) 10ArielGlenn: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man)
[08:13:13] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[08:13:40] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man)
[08:18:13] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[08:22:28] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. I can merge when I'm back from the offsite unless anyone beats me to it earlier." [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[08:23:13] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[08:25:38] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris)
[08:26:13] <icinga-wm_>	 RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms
[08:27:04] <icinga-wm_>	 RECOVERY - very high load average likely xfs on ms-be1002 is OK: OK - load average: 26.61, 7.68, 2.65
[08:27:50] <wikibugs>	 06Operations, 07Puppet, 05Goal, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2669840 (10Joe)
[08:27:52] <wikibugs>	 06Operations, 07Puppet, 05Goal, 13Patch-For-Review, 05Puppet-infrastructure-modernization: Set up a puppet frontend in codfw who can work as a slave of eqiad's master - https://phabricator.wikimedia.org/T143869#2669839 (10Joe) 05Open>03Resolved
[08:27:54] <icinga-wm_>	 RECOVERY - MegaRAID on ms-be1002 is OK: OK: optimal, 14 logical, 14 physical
[08:28:45] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 
[08:29:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris)
[08:32:41] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 
[08:38:33] <wikibugs>	 06Operations, 10ops-eqiad: ms-be1002.eqiad.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T146741#2669846 (10fgiunchedi) 03NEW
[08:38:40] <wikibugs>	 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669336 (10ArielGlenn) modules/icinga/manifests/monitor/ores.pp already lists team-ores as the contact group for the service checks, just like ores_l...
[08:38:49] <wikibugs>	 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669854 (10ArielGlenn) p:05Triage>03Normal
[08:39:05] <wikibugs>	 06Operations, 13Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#2669855 (10BBlack) +1 on tying this to 4.4+ and it being a good idea to get it going again.  I'm not sure (unless someone's investigated already) that our latest 3.1[69] and such kernels don't still have the in...
[08:39:25] <icinga-wm_>	 PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:47:41] <wikibugs>	 06Operations, 10ops-eqiad: ms-be1002.eqiad.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T146741#2669896 (10fgiunchedi) puppet chokes while running xfs_admin, and gets stuck in status D. Also smartctl can read basic info but not return attributes, it looks like to me the disk is dead. I've...
[08:50:23] <grrrit-wm>	 (03PS6) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 
[08:50:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris)
[08:56:16] <icinga-wm_>	 PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:05:24] <icinga-wm_>	 PROBLEM - Host lutetium is DOWN: CRITICAL - Network Unreachable (208.80.155.13)
[09:05:34] <icinga-wm_>	 PROBLEM - Host frqueue1002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:06:01] <apergos>	 Jeff_Green: ?
[09:06:31] <Jeff_Green>	 looking...
[09:06:39] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2669921 (10Dzahn) As discussed during ops offsite, we'll keep using RT for now and also investigate the possibilty of a shared inbox, but won't go back to the mailing list.
[09:06:49] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2669922 (10Dzahn) 05Open>03Resolved
[09:06:54] <icinga-wm_>	 PROBLEM - Host pfw-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.218)
[09:07:16] <icinga-wm_>	 RECOVERY - Host frqueue1002 is UP: PING WARNING - Packet loss = 58%, RTA = 1.54 ms
[09:07:25] <icinga-wm_>	 RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 8.23 ms
[09:07:29] <Jeff_Green>	 ahhh, firewall.. that explains it
[09:07:34] <icinga-wm_>	 RECOVERY - Host pfw-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms
[09:07:34] <icinga-wm_>	 RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:09:15] <grrrit-wm>	 (03PS4) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) 
[09:10:40] <grrrit-wm>	 (03CR) 10MarcoAurelio: "I'd suggest to upload the logo in a separate patch, so it can be optiPNG'd. I've set the logo path here though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio)
[09:16:46] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[09:19:26] <icinga-wm_>	 RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[09:29:15] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[09:32:11] <wikibugs>	 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Dzahn)
[09:32:45] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: Enable future parser based on .future file (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris)
[09:35:59] <grrrit-wm>	 (03PS7) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 
[09:36:34] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris)
[09:39:05] <grrrit-wm>	 (03PS8) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 
[09:40:22] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris)
[09:40:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris)
[09:41:56] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0]
[09:46:57] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[09:51:26] <grrrit-wm>	 (03PS1) 10Urbanecm: Upload 1x logo for olowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312977 (https://phabricator.wikimedia.org/T146745) 
[09:52:25] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 
[09:54:10] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 (owner: 10Alexandros Kosiaris)
[09:56:01] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 
[09:57:54] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 (owner: 10Alexandros Kosiaris)
[10:02:19] <icinga-wm_>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 34 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[10:08:42] <icinga-wm_>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[10:09:39] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[10:17:09] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[10:29:20] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Revert "Stabilize the output of stdlib's keys function" [puppet] - 10https://gerrit.wikimedia.org/r/312984 
[10:30:54] <grrrit-wm>	 (03PS1) 10Ottomata: Add gbp.conf [debs/python-snakebite] - 10https://gerrit.wikimedia.org/r/312985 
[10:31:18] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2670107 (10Gilles) I've packaged manhole: https://github.com/gi11es/thumbor-debian/tree/master/python-manhole  I couldn't get the tests to run, they're...
[10:31:22] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Add gbp.conf [debs/python-snakebite] - 10https://gerrit.wikimedia.org/r/312985 (owner: 10Ottomata)
[10:33:53] <icinga-wm_>	 PROBLEM - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[10:34:09] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2670131 (10Gilles) @fgiunchedi can you build it and put it on jessie-wikimedia?
[10:35:00] <icinga-wm_>	 PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1]
[10:40:53] <icinga-wm_>	 ACKNOWLEDGEMENT - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi sdl broken T146741
[10:40:53] <icinga-wm_>	 ACKNOWLEDGEMENT - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1] Filippo Giunchedi sdl broken T146741
[10:43:58] <grrrit-wm>	 (03PS1) 10Ottomata: Install python snakebite (hdfs client) on analytics cluster clients [puppet] - 10https://gerrit.wikimedia.org/r/312987 
[10:45:50] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Install python snakebite (hdfs client) on analytics cluster clients [puppet] - 10https://gerrit.wikimedia.org/r/312987 (owner: 10Ottomata)
[10:46:12] <ottomata>	 akosiaris: i'm puppet merging
[10:46:13] <ottomata>	 s'ok?
[10:46:21] <ottomata>	 conftool change?
[10:53:40] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] Move scap package version to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) (owner: 1020after4)
[10:54:31] <grrrit-wm>	 (03CR) 10Hashar: "Can you add the 'latest' override in hiera override in hieradata/labs/deployment-prep/common.yaml while at it ? :]" [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) (owner: 1020after4)
[10:56:46] <wikibugs>	 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10hashar) Looks like the systemd for `udp2log-mw` needs some polishing. From T146723 it does not seem to always start properly.
[10:56:56] <wikibugs>	 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2670187 (10hashar)
[11:03:14] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2670227 (10Pcoombe) >>! In T144952#2668547, @spatton wrote: > @awight our typical practice is to disab...
[12:13:22] <grrrit-wm1>	 (03CR) 10Hashar: [C: 031] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[12:20:52] <icinga-wm_>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 688 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3886622 keys - replication_delay is 688
[12:58:52] <icinga-wm_>	 PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:04:06] <icinga-wm_>	 ACKNOWLEDGEMENT - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 405 MB (0% inode=99%): Gehel Ops off site, this is a non critical server (no user traffic), investigation will come later - gehel
[13:24:08] <icinga-wm_>	 RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:46:47] <icinga-wm_>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:48:58] <icinga-wm_>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[15:38:29] <icinga-wm_>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out
[15:52:42] <grrrit-wm1>	 (03Draft1) 10Paladox: Fix copying text from comments in Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/313029 
[15:56:49] <grrrit-wm1>	 (03PS2) 10Paladox: Gerrit: Fix copying text from comments in Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/313029 
[15:56:51] <icinga-wm_>	 PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused
[15:57:57] <icinga-wm_>	 PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[15:59:41] <_joe_>	 what's up with maps-test2001?
[15:59:46] <urandom>	 ^^^ it's an OOM
[15:59:54] <urandom>	 not sure why
[15:59:59] <urandom>	 is this machine in use?
[16:01:22] <urandom>	 anyway, i don't seem to have the karma needed to do anything with it
[16:02:08] <urandom>	 there is probably a heap dump in /var/lib/cassandra
[16:02:16] <urandom>	 nothing in logs that seems helpful
[16:02:23] <urandom>	 it can probably be started back up
[16:02:28] <urandom>	 (by someone who can...)
[16:03:46] <moritzm>	 maps-test2* was replaced by maps2*, these are only for testing now
[16:04:19] <urandom>	 that sounds familiar
[16:06:24] <grrrit-wm1>	 (03PS1) 10Alex Monk: labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it [puppet] - 10https://gerrit.wikimedia.org/r/313034 
[16:10:47] <grrrit-wm1>	 (03PS1) 10Anomie: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 
[16:20:40] <icinga-wm_>	 RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active
[16:20:53] <grrrit-wm1>	 (03CR) 10BryanDavis: "LGTM, should be cherry-picked to deployment-prep and tested there to ensure that things actually work as desired." [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie)
[16:22:01] <icinga-wm_>	 RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.037 second response time on port 9042
[16:22:34] <grrrit-wm1>	 (03PS1) 10EBernhardson: [cirrus] Rename CirrusSearchMoreLikeThisCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313037 
[16:24:28] <icinga-wm_>	 PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:26:54] <grrrit-wm1>	 (03PS2) 10Alex Monk: labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it [puppet] - 10https://gerrit.wikimedia.org/r/313034 
[16:40:58] <NotASpy>	 can we have our own IRC network now ?
[16:51:49] <icinga-wm_>	 RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:40:18] <icinga-wm_>	 PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:05:08] <icinga-wm_>	 RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[18:27:45] <wikibugs>	 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2671468 (10Halfak) Eek.  It looks like I *am* getting emails and they were just not being filtered appropriately.  Thanks for looking @ArielGlenn
[18:27:57] <wikibugs>	 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: Ensure that halfak gets emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2671470 (10Halfak)
[18:28:12] <wikibugs>	 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: Ensure that halfak gets emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669336 (10Halfak) 05Open>03Resolved
[18:38:08] <icinga-wm_>	 PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:01:35] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0]
[19:04:11] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[19:05:33] <icinga-wm_>	 RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:17:52] <icinga-wm_>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1700.95 seconds
[19:18:16] <wikibugs>	 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2671659 (10Eevans) One possibility for a short-term fix  might be something as simple as a script generated weekly report that is emailed to services@.   This script c...
[19:43:15] <icinga-wm_>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3206.22 seconds Jcrespo normal, expected lag
[19:47:58] <grrrit-wm>	 (03PS1) 10Paladox: Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) 
[19:48:02] <paladox>	 hashar ^^ would that work?
[19:48:59] <grrrit-wm>	 (03PS2) 10Paladox: Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) 
[19:49:01] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) (owner: 10Paladox)
[20:20:32] <grrrit-wm>	 (03CR) 10Hashar: "We will probably want to use "deployment-tin" as the primary, that saves us from updating a lot of documentations and retrain deployers." [puppet] - 10https://gerrit.wikimedia.org/r/312654 (https://phabricator.wikimedia.org/T144006) (owner: 10Hashar)
[20:22:56] <grrrit-wm>	 (03Abandoned) 10Hashar: Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) (owner: 10Paladox)
[20:32:52] <icinga-wm_>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3807362 keys - replication_delay is 0
[21:06:02] <icinga-wm_>	 PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[21:06:47] <icinga-wm_>	 PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused
[21:12:27] <icinga-wm_>	 PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:20:58] <icinga-wm_>	 RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active
[21:21:42] <icinga-wm_>	 RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.041 second response time on port 9042
[21:37:31] <icinga-wm_>	 RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:30:07] <icinga-wm_>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3818051 keys - replication_delay is 647
[22:34:21] <wikibugs>	 06Operations, 06Discovery, 10Kartographer, 06Maps, 10Traffic: Clarify caching to enable direct Wikidata Query Service access by <mapframe/link> - https://phabricator.wikimedia.org/T146832#2672412 (10MaxSem)
[22:39:59] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "want." [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) (owner: 1020after4)
[22:56:44] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2672426 (10awight) So, @spatton and I just ran the experiment suggested above, where we add a broken b...
[22:58:02] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2672430 (10awight) Tricking the HTTP cache with extra parameters like "foo=1" doesn't change the respo...
[23:00:03] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2672432 (10awight) For some reason, this query is giving zero results.  I must be in a parallel univer...
[23:38:05] <wikibugs>	 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2672556 (10Peachey88) * Google Groups (Collaborative mailbox mode) ** https://support.google.com/a/answer/167430  * OTRS ** (+) Can have private comments/notes (no idea how it works in real life practise) ** (+) Already in...
[23:52:04] <wikibugs>	 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Krenair) >>! In T146746#2672556, @Peachey88 wrote: > * OTRS > ** (+) Can have private comments/notes (no idea how it works in real life practise)  Assuming it still works the same way it did when I was an agent...
[23:54:41] <wikibugs>	 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Platonides) Yes, it still works like that.