[00:00:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 39907 seconds ago, expected 28800 [00:04:05] PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 100.17, 100.14, 99.97 [00:05:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 40206 seconds ago, expected 28800 [00:10:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 40506 seconds ago, expected 28800 [00:15:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 40807 seconds ago, expected 28800 [00:20:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 41106 seconds ago, expected 28800 [00:22:14] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [00:25:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 41406 seconds ago, expected 28800 [00:29:08] jesus christ that barium alert [00:29:21] it's a fundraising cluster host [00:30:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 41706 seconds ago, expected 28800 [00:35:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42006 seconds ago, expected 28800 [00:40:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42307 seconds ago, expected 28800 [00:44:47] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:45:07] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42606 seconds ago, expected 28800 [00:50:17] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 42906 seconds ago, expected 28800 [00:55:07] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 43206 seconds ago, expected 28800 [01:00:17] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 43506 seconds ago, expected 28800 [01:05:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 43806 seconds ago, expected 28800 [01:10:02] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:10:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 44106 seconds ago, expected 28800 [01:12:52] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:15:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 44406 seconds ago, expected 28800 [01:20:11] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 44706 seconds ago, expected 28800 [01:23:01] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:25:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 45006 seconds ago, expected 28800 [01:30:16] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 133 seconds ago with 0 failures [01:40:30] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:48:10] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:35:26] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:42:46] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:12:47] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:30:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3883515 keys - replication_delay is 0 [03:39:57] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:49:57] (03CR) 10Legoktm: "T146619 was filed about .org no longer redirecting." [dns] - 10https://gerrit.wikimedia.org/r/244092 (owner: 10Dzahn) [03:57:17] PROBLEM - Disk space on ms-be2009 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error [03:58:36] PROBLEM - MegaRAID on ms-be2009 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [04:02:27] RECOVERY - Disk space on ms-be2009 is OK: DISK OK [04:19:41] legoktm: https://gerrit.wikimedia.org/r/#/c/312945/ [04:19:55] hi AaronSchulz :) [04:20:32] AaronSchulz: you should do that in Wikibase too ;) [04:21:05] me hellos are of the form [04:21:29] PREABLE= https://gerrit.wikimedia.org/r/#/c/; ID=\d+ [04:22:01] :P [04:22:58] I CC;'d you on a class_exists() thread [04:22:59] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdd1] [04:23:10] could be useful for DatabaseBase [04:25:29] AaronSchulz: why class_alias over class DatabaseBase extends Database {} ? [04:25:40] (or the other way around) [04:25:56] also, we're in the wrong channel ;) [04:41:59] (03CR) 1020after4: [C: 031] Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [04:45:40] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:10:38] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:32:13] <_joe_> !log rebooting ms-be1002, stuck in a failed disk [05:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:37:40] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [05:40:44] ACKNOWLEDGEMENT - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Failing vd-11 [06:35:41] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tree] [07:00:11] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:21:46] (03PS1) 1020after4: Move scap package version to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/312971 [07:27:07] (03PS2) 1020after4: Move scap package version to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) [07:47:54] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2669818 (10BBlack) [07:47:57] 06Operations, 10Traffic: Remove "GeoIP lookup" service from https://status.wikimedia.org - https://phabricator.wikimedia.org/T146638#2669815 (10BBlack) 05Open>03Resolved a:03BBlack Thanks for finding this, should've removed before service decom. It's gone now. [07:59:14] 06Operations, 10Mail, 07LDAP, 13Patch-For-Review: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2669822 (10MoritzMuehlenhoff) 05Open>03Resolved @bbogaert : Confirmed, the users awight, bboegart and bbunny have the attribute in the corp LDAP mirror. [08:07:53] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2669825 (10Gilles) [08:08:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:12:06] (03PS6) 10ArielGlenn: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man) [08:13:13] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:13:40] (03CR) 10ArielGlenn: [C: 032] More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man) [08:18:13] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:22:28] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. I can merge when I'm back from the offsite unless anyone beats me to it earlier." [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [08:23:13] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:25:38] (03CR) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris) [08:26:13] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [08:27:04] RECOVERY - very high load average likely xfs on ms-be1002 is OK: OK - load average: 26.61, 7.68, 2.65 [08:27:50] 06Operations, 07Puppet, 05Goal, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2669840 (10Joe) [08:27:52] 06Operations, 07Puppet, 05Goal, 13Patch-For-Review, 05Puppet-infrastructure-modernization: Set up a puppet frontend in codfw who can work as a slave of eqiad's master - https://phabricator.wikimedia.org/T143869#2669839 (10Joe) 05Open>03Resolved [08:27:54] RECOVERY - MegaRAID on ms-be1002 is OK: OK: optimal, 14 logical, 14 physical [08:28:45] (03PS4) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 [08:29:17] (03CR) 10jenkins-bot: [V: 04-1] Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris) [08:32:41] (03PS5) 10Alexandros Kosiaris: Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 [08:38:33] 06Operations, 10ops-eqiad: ms-be1002.eqiad.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T146741#2669846 (10fgiunchedi) 03NEW [08:38:40] 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669336 (10ArielGlenn) modules/icinga/manifests/monitor/ores.pp already lists team-ores as the contact group for the service checks, just like ores_l... [08:38:49] 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669854 (10ArielGlenn) p:05Triage>03Normal [08:39:05] 06Operations, 13Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#2669855 (10BBlack) +1 on tying this to 4.4+ and it being a good idea to get it going again. I'm not sure (unless someone's investigated already) that our latest 3.1[69] and such kernels don't still have the in... [08:39:25] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:47:41] 06Operations, 10ops-eqiad: ms-be1002.eqiad.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T146741#2669896 (10fgiunchedi) puppet chokes while running xfs_admin, and gets stuck in status D. Also smartctl can read basic info but not return attributes, it looks like to me the disk is dead. I've... [08:50:23] (03PS6) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [08:50:57] (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [08:56:16] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:24] PROBLEM - Host lutetium is DOWN: CRITICAL - Network Unreachable (208.80.155.13) [09:05:34] PROBLEM - Host frqueue1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:06:01] Jeff_Green: ? [09:06:31] looking... [09:06:39] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2669921 (10Dzahn) As discussed during ops offsite, we'll keep using RT for now and also investigate the possibilty of a shared inbox, but won't go back to the mailing list. [09:06:49] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2669922 (10Dzahn) 05Open>03Resolved [09:06:54] PROBLEM - Host pfw-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.218) [09:07:16] RECOVERY - Host frqueue1002 is UP: PING WARNING - Packet loss = 58%, RTA = 1.54 ms [09:07:25] RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 8.23 ms [09:07:29] ahhh, firewall.. that explains it [09:07:34] RECOVERY - Host pfw-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [09:07:34] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:09:15] (03PS4) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) [09:10:40] (03CR) 10MarcoAurelio: "I'd suggest to upload the logo in a separate patch, so it can be optiPNG'd. I've set the logo path here though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [09:16:46] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [09:19:26] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:29:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [09:32:11] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Dzahn) [09:32:45] (03CR) 10Alexandros Kosiaris: Enable future parser based on .future file (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [09:35:59] (03PS7) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [09:36:34] (03CR) 10jenkins-bot: [V: 04-1] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [09:39:05] (03PS8) 10Alexandros Kosiaris: Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 [09:40:22] (03CR) 10Alexandros Kosiaris: [C: 032] Allow extra arguments to be passed to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312516 (owner: 10Alexandros Kosiaris) [09:40:55] (03CR) 10Alexandros Kosiaris: [C: 032] Enable future parser based on .future file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/312517 (owner: 10Alexandros Kosiaris) [09:41:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [09:46:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:51:26] (03PS1) 10Urbanecm: Upload 1x logo for olowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312977 (https://phabricator.wikimedia.org/T146745) [09:52:25] (03PS1) 10Alexandros Kosiaris: puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 [09:54:10] (03CR) 10jenkins-bot: [V: 04-1] puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 (owner: 10Alexandros Kosiaris) [09:56:01] (03PS2) 10Alexandros Kosiaris: puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 [09:57:54] (03CR) 10Alexandros Kosiaris: [C: 032] puppet_compiler: Fix breakage caused by I76610eb72 [puppet] - 10https://gerrit.wikimedia.org/r/312978 (owner: 10Alexandros Kosiaris) [10:02:19] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 34 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:08:42] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [10:09:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [10:17:09] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:29:20] (03PS1) 10Alexandros Kosiaris: Revert "Stabilize the output of stdlib's keys function" [puppet] - 10https://gerrit.wikimedia.org/r/312984 [10:30:54] (03PS1) 10Ottomata: Add gbp.conf [debs/python-snakebite] - 10https://gerrit.wikimedia.org/r/312985 [10:31:18] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2670107 (10Gilles) I've packaged manhole: https://github.com/gi11es/thumbor-debian/tree/master/python-manhole I couldn't get the tests to run, they're... [10:31:22] (03CR) 10Ottomata: [C: 032 V: 032] Add gbp.conf [debs/python-snakebite] - 10https://gerrit.wikimedia.org/r/312985 (owner: 10Ottomata) [10:33:53] PROBLEM - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [10:34:09] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2670131 (10Gilles) @fgiunchedi can you build it and put it on jessie-wikimedia? [10:35:00] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1] [10:40:53] ACKNOWLEDGEMENT - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi sdl broken T146741 [10:40:53] ACKNOWLEDGEMENT - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdl1] Filippo Giunchedi sdl broken T146741 [10:43:58] (03PS1) 10Ottomata: Install python snakebite (hdfs client) on analytics cluster clients [puppet] - 10https://gerrit.wikimedia.org/r/312987 [10:45:50] (03CR) 10Ottomata: [C: 032] Install python snakebite (hdfs client) on analytics cluster clients [puppet] - 10https://gerrit.wikimedia.org/r/312987 (owner: 10Ottomata) [10:46:12] akosiaris: i'm puppet merging [10:46:13] s'ok? [10:46:21] conftool change? [10:53:40] (03CR) 10Hashar: [C: 031] Move scap package version to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) (owner: 1020after4) [10:54:31] (03CR) 10Hashar: "Can you add the 'latest' override in hiera override in hieradata/labs/deployment-prep/common.yaml while at it ? :]" [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) (owner: 1020after4) [10:56:46] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10hashar) Looks like the systemd for `udp2log-mw` needs some polishing. From T146723 it does not seem to always start properly. [10:56:56] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2670187 (10hashar) [11:03:14] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2670227 (10Pcoombe) >>! In T144952#2668547, @spatton wrote: > @awight our typical practice is to disab... [12:13:22] (03CR) 10Hashar: [C: 031] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [12:20:52] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 688 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3886622 keys - replication_delay is 688 [12:58:52] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:06] ACKNOWLEDGEMENT - Disk space on maps-test2001 is CRITICAL: DISK CRITICAL - free space: /srv 405 MB (0% inode=99%): Gehel Ops off site, this is a non critical server (no user traffic), investigation will come later - gehel [13:24:08] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:46:47] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:58] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:38:29] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [15:52:42] (03Draft1) 10Paladox: Fix copying text from comments in Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/313029 [15:56:49] (03PS2) 10Paladox: Gerrit: Fix copying text from comments in Internet Explorer [puppet] - 10https://gerrit.wikimedia.org/r/313029 [15:56:51] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused [15:57:57] PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [15:59:41] <_joe_> what's up with maps-test2001? [15:59:46] ^^^ it's an OOM [15:59:54] not sure why [15:59:59] is this machine in use? [16:01:22] anyway, i don't seem to have the karma needed to do anything with it [16:02:08] there is probably a heap dump in /var/lib/cassandra [16:02:16] nothing in logs that seems helpful [16:02:23] it can probably be started back up [16:02:28] (by someone who can...) [16:03:46] maps-test2* was replaced by maps2*, these are only for testing now [16:04:19] that sounds familiar [16:06:24] (03PS1) 10Alex Monk: labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it [puppet] - 10https://gerrit.wikimedia.org/r/313034 [16:10:47] (03PS1) 10Anomie: Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 [16:20:40] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [16:20:53] (03CR) 10BryanDavis: "LGTM, should be cherry-picked to deployment-prep and tested there to ensure that things actually work as desired." [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [16:22:01] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.037 second response time on port 9042 [16:22:34] (03PS1) 10EBernhardson: [cirrus] Rename CirrusSearchMoreLikeThisCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313037 [16:24:28] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:54] (03PS2) 10Alex Monk: labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it [puppet] - 10https://gerrit.wikimedia.org/r/313034 [16:40:58] can we have our own IRC network now ? [16:51:49] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:18] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:08] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:27:45] 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: halfak should get emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2671468 (10Halfak) Eek. It looks like I *am* getting emails and they were just not being filtered appropriately. Thanks for looking @ArielGlenn [18:27:57] 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: Ensure that halfak gets emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2671470 (10Halfak) [18:28:12] 06Operations, 10Icinga, 06Revision-Scoring-As-A-Service: Ensure that halfak gets emails when ores.wikimedia.org goes down - https://phabricator.wikimedia.org/T146720#2669336 (10Halfak) 05Open>03Resolved [18:38:08] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [19:04:11] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:05:33] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:17:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1700.95 seconds [19:18:16] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2671659 (10Eevans) One possibility for a short-term fix might be something as simple as a script generated weekly report that is emailed to services@. This script c... [19:43:15] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3206.22 seconds Jcrespo normal, expected lag [19:47:58] (03PS1) 10Paladox: Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) [19:48:02] hashar ^^ would that work? [19:48:59] (03PS2) 10Paladox: Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) [19:49:01] (03CR) 10jenkins-bot: [V: 04-1] Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) (owner: 10Paladox) [20:20:32] (03CR) 10Hashar: "We will probably want to use "deployment-tin" as the primary, that saves us from updating a lot of documentations and retrain deployers." [puppet] - 10https://gerrit.wikimedia.org/r/312654 (https://phabricator.wikimedia.org/T144006) (owner: 10Hashar) [20:22:56] (03Abandoned) 10Hashar: Require graphoid::packages in the npm test [puppet] - 10https://gerrit.wikimedia.org/r/313058 (https://phabricator.wikimedia.org/T146783) (owner: 10Paladox) [20:32:52] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3807362 keys - replication_delay is 0 [21:06:02] PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [21:06:47] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused [21:12:27] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:20:58] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [21:21:42] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.041 second response time on port 9042 [21:37:31] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:30:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3818051 keys - replication_delay is 647 [22:34:21] 06Operations, 06Discovery, 10Kartographer, 06Maps, 10Traffic: Clarify caching to enable direct Wikidata Query Service access by - https://phabricator.wikimedia.org/T146832#2672412 (10MaxSem) [22:39:59] (03CR) 10Thcipriani: [C: 031] "want." [puppet] - 10https://gerrit.wikimedia.org/r/312971 (https://phabricator.wikimedia.org/T146618) (owner: 1020after4) [22:56:44] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2672426 (10awight) So, @spatton and I just ran the experiment suggested above, where we add a broken b... [22:58:02] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2672430 (10awight) Tricking the HTTP cache with extra parameters like "foo=1" doesn't change the respo... [23:00:03] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2672432 (10awight) For some reason, this query is giving zero results. I must be in a parallel univer... [23:38:05] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2672556 (10Peachey88) * Google Groups (Collaborative mailbox mode) ** https://support.google.com/a/answer/167430 * OTRS ** (+) Can have private comments/notes (no idea how it works in real life practise) ** (+) Already in... [23:52:04] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Krenair) >>! In T146746#2672556, @Peachey88 wrote: > * OTRS > ** (+) Can have private comments/notes (no idea how it works in real life practise) Assuming it still works the same way it did when I was an agent... [23:54:41] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10Platonides) Yes, it still works like that.