[00:48:24] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%)
[02:23:24] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:51:25] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[02:55:34] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[02:55:59] <Amir1>	 I'll probably deploy ores soon-ish
[02:56:15] <Amir1>	 unscheduled obviously: https://phabricator.wikimedia.org/T154168
[02:56:24] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[02:56:39] <Amir1>	 Just phab is not fast enough to catch up with gerrit
[02:58:24] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[03:01:24] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[03:01:44] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[03:03:24] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[03:03:34] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[03:12:52] <wikibugs>	 (03PS1) 10Ladsgroup: wikilabels: install nodejs package [puppet] - 10https://gerrit.wikimedia.org/r/329316 (https://phabricator.wikimedia.org/T154122)
[03:24:24] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 722.70 seconds
[03:29:24] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 173.69 seconds
[03:29:44] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[04:09:04] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=307.10 Read Requests/Sec=275.80 Write Requests/Sec=0.50 KBytes Read/Sec=35266.40 KBytes_Written/Sec=4.40
[04:12:28] <wikibugs>	 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903056 (10Peachey88)
[04:17:04] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=13.40 Read Requests/Sec=0.50 Write Requests/Sec=8.20 KBytes Read/Sec=2.80 KBytes_Written/Sec=42.00
[04:26:24] <icinga-wm>	 PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:54:24] <icinga-wm>	 RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[05:06:06] <Amir1>	 !log starting deploy of ores:228b9b4 in canary nodes (T154168)
[05:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:10] <stashbot>	 T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168
[05:06:18] <logmsgbot>	 !log ladsgroup@tin Starting deploy [ores/deploy@228b9b4]: (no message)
[05:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:37] <Amir1>	 !log starting deploy of ores:228b9b4 in all nodes (T154168)
[05:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:40] <stashbot>	 T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168
[05:25:05] <logmsgbot>	 !log ladsgroup@tin Finished deploy [ores/deploy@228b9b4]: (no message) (duration: 18m 46s)
[05:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:27] <Amir1>	 !log finished deploy of ores:228b9b4 in all nodes (T154168)
[05:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:30] <stashbot>	 T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168
[05:44:10] <Amir1>	 !log ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=wikidatawiki (T154168)
[05:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:44:13] <stashbot>	 T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168
[05:50:01] <wikibugs>	 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903017 (10Parent5446) Possible things to do:  - [ ] Ensure the provided email matches the email of the account. Then send an email confirming the request. - [ ] Challenge him to provide his c...
[05:55:16] <wikibugs>	 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903136 (10Jalexander) a:03Jalexander I can look into this Tuesday if no one else is able to, once verified we can remove 2fa. Will send an email to the wiki users address now to start a cou...
[05:58:42] <wikibugs>	 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903139 (10Parent5446) @Jalexander is there a specific tag for these requests / can we make one? It would be useful for tracking to figure out how high priority T131789 and related tasks shoul...
[06:25:24] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:25:34] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[06:31:44] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:53:34] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:55:24] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:35:51] <wikibugs>	 (03PS1) 10ArielGlenn: media title dumps: use explicit path to list of wikis with globaluseagelist [puppet] - 10https://gerrit.wikimedia.org/r/329323
[07:41:51] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] media title dumps: use explicit path to list of wikis with globaluseagelist [puppet] - 10https://gerrit.wikimedia.org/r/329323 (owner: 10ArielGlenn)
[08:23:23] <wikibugs>	 06Operations, 10DBA, 07Chinese-Sites: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#2903280 (10Shizhao)
[08:51:34] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[08:52:34] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[08:54:34] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:59:34] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[09:02:42] <_joe_>	 uhm
[09:02:46] <_joe_>	 checking
[09:03:33] <_joe_>	 a transient problem, it seems
[09:03:34] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:04:00] <_joe_>	 it was 500s, so MediaWiki errors
[09:07:34] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:09:04] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Leave authn out of authz if clause [puppet] - 10https://gerrit.wikimedia.org/r/329326
[10:15:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] k8s::apiserver: Leave authn out of authz if clause [puppet] - 10https://gerrit.wikimedia.org/r/329326 (owner: 10Alexandros Kosiaris)
[10:52:14] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/ab50d9422116c55331ef766093548bd565d407b679e13ec8276463b9540d470a/shm is not accessible: Permission denied
[10:52:14] <icinga-wm>	 PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/386db64ef8c000727e583d75837ff32fa95468bfb681230bb6fe0b630ae0dd79/shm is not accessible: Permission denied
[10:52:50] <_joe_>	 ?
[10:53:17] <_joe_>	 akosiaris: we might want to change check_disk on those machines
[10:54:27] <akosiaris>	 _joe_: just running my very first pods
[10:54:38] <_joe_>	 yeah I figured
[10:54:38] <akosiaris>	 finally, after some PEBKACs last week
[10:54:44] <_joe_>	 \o/
[10:56:26] <akosiaris>	 funny how mount shows /var/lib/docker/devicemapper/mnt/9f80c2485e582ff1e551f374a62a842eca33e76e24bc2fc5b23a2019baa993e1 but df doesn't
[10:56:30] <akosiaris>	 xfs ?
[10:56:34] <icinga-wm>	 PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:56:50] <akosiaris>	 why on earth are we using XFS there ? not that it matter much right now
[10:57:09] <akosiaris>	 but when we go production I 'd like to not have machines lock up due to xfs issues
[11:18:02] <_joe_>	 akosiaris: ask docker
[11:18:25] <_joe_>	 we did just create the lvm volumes
[11:20:58] <akosiaris>	 yeah docker info says XFS
[11:21:12] <akosiaris>	 anyway, will figure out at some other point. making a note for now
[11:24:34] <icinga-wm>	 RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[12:04:31] <wikibugs>	 (03PS1) 10Tim Landscheidt: postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328
[12:09:33] <wikibugs>	 (03CR) 10Tim Landscheidt: "Tested with a) user not existing (which causes pass_set to be triggered by create_user) and b) password changed ("ALTER ROLE replication P" [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt)
[12:23:44] <icinga-wm>	 PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:30:57] <wikibugs>	 (03PS1) 10Tim Landscheidt: puppetdb: Do not set up Ganglia in Labs [puppet] - 10https://gerrit.wikimedia.org/r/329329 (https://phabricator.wikimedia.org/T154104)
[12:31:17] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2903464 (10scfc) a:03scfc
[12:48:41] <wikibugs>	 (03PS1) 10Tim Landscheidt: puppetdb: Do not hardcode puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577)
[12:51:39] <wikibugs>	 (03CR) 10Tim Landscheidt: "Tested the inline_template() call and it is 1:1 the existing ferm rule for the production Hiera data; yet, of course, this requires a bit " [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) (owner: 10Tim Landscheidt)
[12:51:44] <icinga-wm>	 RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[12:54:41] <wikibugs>	 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2903500 (10scfc) a:03scfc
[13:24:34] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[13:27:54] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:28:54] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational
[13:35:16] <_joe_>	 uhm
[13:35:26] <_joe_>	 what happened there akosiaris ?
[13:45:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Instruct docker to not handle iptables [puppet] - 10https://gerrit.wikimedia.org/r/329333
[13:45:07] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Force docker bridge IP address per host [puppet] - 10https://gerrit.wikimedia.org/r/329334
[13:49:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Force docker bridge IP address per host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329334 (owner: 10Alexandros Kosiaris)
[13:50:49] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Force docker bridge IP address per host [puppet] - 10https://gerrit.wikimedia.org/r/329334
[13:52:34] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[13:54:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes::apiserver: Fix admission_control if clause [puppet] - 10https://gerrit.wikimedia.org/r/329337
[13:55:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Addressed Giuseppe comment (that answer was "by mistake"). PCC ran successfully and with expected output at https://puppet-compiler.wmflab" [puppet] - 10https://gerrit.wikimedia.org/r/329334 (owner: 10Alexandros Kosiaris)
[13:56:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Instruct docker to not handle iptables [puppet] - 10https://gerrit.wikimedia.org/r/329333 (owner: 10Alexandros Kosiaris)
[13:56:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Force docker bridge IP address per host [puppet] - 10https://gerrit.wikimedia.org/r/329334 (owner: 10Alexandros Kosiaris)
[13:57:17] <akosiaris>	 _joe_: me fighting with docker. Part of the problems fixed in patches above
[13:57:44] <_joe_>	 cool
[13:58:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes::apiserver: Fix admission_control if clause [puppet] - 10https://gerrit.wikimedia.org/r/329337 (owner: 10Alexandros Kosiaris)
[14:01:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK
[14:01:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1002 is OK: DISK OK
[14:01:34] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[docker]
[14:02:34] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[14:07:24] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/a66cf08a28cba21908040c2259587c0aa39fe25a2e5fff452b5d7face8c9f33e/shm is not accessible: Permission denied
[14:09:54] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:12:24] <icinga-wm>	 PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/10f7cd0ee5b65452e14ef7799cdbb1243f38c5ddff530f0262b90150d29dea28/shm is not accessible: Permission denied
[14:22:26] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: openstack::horizon::service: use require_package for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/329339
[14:37:54] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[14:41:54] <icinga-wm>	 PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:56:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] openstack::horizon::service: use require_package for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/329339 (owner: 10Giuseppe Lavagetto)
[15:09:24] <icinga-wm>	 PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100%
[15:10:17] <_joe_>	 I guess this is Jeff n Chris?
[15:10:54] <icinga-wm>	 RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[15:11:14] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 105, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/7: down - bariumBR
[15:12:14] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 107, down: 0, dormant: 0, excluded: 2, unused: 0
[15:12:47] <icinga-wm>	 ACKNOWLEDGEMENT - Host barium is DOWN: PING CRITICAL - Packet loss = 100% Jeff_Green hard disk replacement
[15:13:06] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[15:19:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1003 is OK: DISK OK
[15:21:24] <icinga-wm>	 PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/f1fcdc1e23d40f3ccad307fd08a9c6420018ec2432a46f6afd7c96a4b984870e/shm is not accessible: Permission denied
[15:24:24] <icinga-wm>	 PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/6ba7762c0da40811ad17511e241197114c34a63ebce3905c71673f5581d4cb1c/shm is not accessible: Permission denied
[15:37:14] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 105, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/7: down - bariumBR
[15:39:14] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 107, down: 0, dormant: 0, excluded: 2, unused: 0
[15:43:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1002 is OK: DISK OK
[15:55:01] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: New release [debs/pybal] - 10https://gerrit.wikimedia.org/r/268629 (owner: 10Giuseppe Lavagetto)
[15:58:14] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 105, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/7: down - bariumBR
[15:59:44] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine]
[15:59:44] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine]
[15:59:44] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine]
[15:59:45] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine]
[16:00:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344
[16:00:24] <icinga-wm>	 PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/c33e528ee164b21156a0d315f523f84a0bfe8cb45217d7572de00c407a2d6533/shm is not accessible: Permission denied
[16:01:14] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 107, down: 0, dormant: 0, excluded: 2, unused: 0
[16:04:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344 (owner: 10Alexandros Kosiaris)
[16:04:15] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344
[16:04:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344 (owner: 10Alexandros Kosiaris)
[16:07:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345
[16:08:03] <_joe_>	 akosiaris: ^^
[16:11:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346
[16:12:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 (owner: 10Giuseppe Lavagetto)
[16:12:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 (owner: 10Giuseppe Lavagetto)
[16:12:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 (owner: 10Giuseppe Lavagetto)
[16:12:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346 (owner: 10Alexandros Kosiaris)
[16:12:44] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346
[16:12:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346 (owner: 10Alexandros Kosiaris)
[16:14:44] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[16:14:44] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[16:14:44] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[16:14:54] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[16:17:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Add /run/docker/netns/ as well in ignored disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329347
[16:23:50] <_joe_>	 !log power down prometheus2003
[16:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:31:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Add /run/docker/netns/ as well in ignored disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329347 (owner: 10Alexandros Kosiaris)
[16:32:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy
[16:32:40] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2903669 (10zhuyifei1999)
[16:35:27] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2903671 (10Joe) My suggestion would be that these 10 new systems should replace mw2075 - mw2090 functionally, and specifically:  - **3 servers** to replace the 5 API appservers mw2075-79 so in **row A** - **...
[16:35:43] <wikibugs>	 06Operations: reinstall/reimage sinistra as mwlog2001 - https://phabricator.wikimedia.org/T153384#2903674 (10Papaul)
[16:35:46] <wikibugs>	 06Operations, 10ops-codfw: update label/racktables visible label for mwlog2001 (was sinistra) - https://phabricator.wikimedia.org/T153771#2903672 (10Papaul) 05Open>03Resolved Complete
[16:35:50] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2903676 (10zhuyifei1999) [[https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&v...
[16:36:09] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2903678 (10Joe) a:05Joe>03RobH
[16:36:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK
[16:39:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1003 is OK: DISK OK
[16:42:24] <icinga-wm>	 RECOVERY - Disk space on kubernetes1002 is OK: DISK OK
[16:44:28] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857513 (10Papaul) @joe are those systems already decommissioned?
[16:45:14] <icinga-wm>	 RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms
[16:46:04] <icinga-wm>	 PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:59:41] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Production kubernetes: Specify the service IP range [puppet] - 10https://gerrit.wikimedia.org/r/329350
[17:06:53] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes apiserver: Allow specifying > 1 apiserver [puppet] - 10https://gerrit.wikimedia.org/r/329351
[17:10:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/4999/ says diff is fine, merging into tool labs first to ensure no breakage" [puppet] - 10https://gerrit.wikimedia.org/r/329351 (owner: 10Alexandros Kosiaris)
[17:14:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "cherry picked in tool labs, was a noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/329351 (owner: 10Alexandros Kosiaris)
[17:14:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Production kubernetes: Specify the service IP range [puppet] - 10https://gerrit.wikimedia.org/r/329350 (owner: 10Alexandros Kosiaris)
[17:14:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes apiserver: Allow specifying > 1 apiserver [puppet] - 10https://gerrit.wikimedia.org/r/329351 (owner: 10Alexandros Kosiaris)
[17:15:04] <icinga-wm>	 RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[17:28:47] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2903705 (10Papaul) 05Open>03Resolved IDRAC card extention replacement complete. The system is back up and using the the dedicated card.
[17:30:44] <icinga-wm>	 PROBLEM - Host db2035 is DOWN: PING CRITICAL - Packet loss = 100%
[17:32:14] <icinga-wm>	 RECOVERY - Host db2035 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[17:35:09] <icinga-wm>	 PROBLEM - mysqld processes on db2035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[17:35:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on db2035 is CRITICAL: CRITICAL slave_sql_state could not connect
[17:35:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on db2035 is CRITICAL: CRITICAL slave_io_state could not connect
[17:41:54] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag could not connect
[17:47:08] <icinga-wm>	 RECOVERY - mysqld processes on db2035 is OK: PROCS OK: 1 process with command name mysqld
[17:49:04] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on db2035 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:49:04] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on db2035 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:52:15] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2035 reset - https://phabricator.wikimedia.org/T154189#2903726 (10akosiaris)
[17:52:37] <wikibugs>	 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2903738 (10Shoichi) @awight  @Niharika @MaxSem: Hi , my translation team started to work.  Our work is in  https://github.com/Wikimedia-TW/han3_ji7_tsoo1...
[17:55:19] <akosiaris>	 papaul: around ?
[17:57:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 27.87 seconds
[17:59:53] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2035 reset - https://phabricator.wikimedia.org/T154189#2903739 (10akosiaris) FWIW, those times seem wrong. According to the OS, the reboot happened on 17:31 UTC and not 16:46, so the time was most probably wrong. That is also further validated by the following log line in...
[18:02:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:03:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[18:23:34] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2035 reset - https://phabricator.wikimedia.org/T154189#2903748 (10akosiaris) Logs indicate that the shutdown was via a graceful via USB keyboard and mouse being plugged in. @Papaul says it is possibly himself while working on db2034 on an independent issue and rebooting th...
[18:48:44] <icinga-wm>	 PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:11:26] <matanya>	 _joe_: are you around by any chance ?
[19:16:44] <icinga-wm>	 RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[19:16:46] <apergos>	 should be gone
[19:20:44] <icinga-wm>	 PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:34:44] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[19:43:44] <icinga-wm>	 PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:48:44] <icinga-wm>	 RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[20:01:44] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%)
[20:01:54] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[20:11:44] <icinga-wm>	 RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[20:13:34] <_joe_>	 matanya: I am now, what's up?
[20:13:52] <_joe_>	 matanya: not for long though, unless it's an emergency
[20:24:54] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:52:54] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[21:05:04] <icinga-wm>	 RECOVERY - check_raid on barium is OK: OK: MegaSAS 1 logical, 2 physical
[21:53:18] <matanya>	 _joe_: it was about requesting permission to add more workers to the video2commons project, if the video scalers catched up
[21:53:23] <matanya>	 but it can wait
[22:01:04] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[22:06:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Transit: Init7 (donated) {#14009} [10Gbps]BR
[22:07:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0
[22:23:44] <icinga-wm>	 PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:29:01] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox)
[22:30:14] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[22:31:03] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904065 (10Paladox) p:05Triage>03Unbreak! Changing priority to break as someone needs to investigate it's impact on prod, as i managed to reproduce it in labs...
[22:33:33] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904069 (10Paladox) This bug may have caused T153079 since that problem happened around the time after we upgraded to gerrit 2.13.
[22:34:57] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox) the data loss includes merged changes
[22:51:44] <icinga-wm>	 RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[23:14:43] <eileen1>	 is anyone around with permissions on the fundraising server
[23:14:58] <Reedy>	 eileen1: Probably not.. What's up?
[23:15:01] <eileen1>	 there is a jenkins issue - looks like a not too hard fix with permissions.....
[23:15:12] <eileen1>	 http://community.bonitasoft.com/hudson-jobs-missing-after-crash-restore-them-ashes
[23:16:25] <eileen1>	 I'll try to phone jeff_green
[23:16:31] <Reedy>	 I was just going to say the same thing :)
[23:16:50] <Reedy>	 And/or David Strine if you need someone to coordindate
[23:17:33] <Reedy>	 And even upto Katie if it's really blocking you :)
[23:17:35] <Reedy>	 eileen1: ^
[23:18:07] <eileen1>	 Thanks - it's delayed fall out from a problem Jeff was dealing with
[23:18:19] <Reedy>	 I was gonna say, I think he was about an hour or so ago
[23:18:26] <Reedy>	 oh, 2
[23:18:44] <icinga-wm>	 PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:19:18] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904128 (10Paladox)
[23:23:28] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904130 (10Paladox) Reported here https://bugs.chromium.org/p/gerrit/issues/detail?id=5200
[23:29:47] <eileen1>	 I left a message on his phone
[23:35:24] <eileen1>	 he is online now - thanks Reedy for the encouragement :-)
[23:36:32] <Reedy>	 :)
[23:36:37] <Reedy>	 Good good
[23:47:44] <icinga-wm>	 RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[23:59:44] <icinga-wm>	 PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues