[00:48:24] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [02:23:24] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:51:25] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:55:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [02:55:59] I'll probably deploy ores soon-ish [02:56:15] unscheduled obviously: https://phabricator.wikimedia.org/T154168 [02:56:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:56:39] Just phab is not fast enough to catch up with gerrit [02:58:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:01:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:01:44] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [03:03:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:03:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:12:52] (03PS1) 10Ladsgroup: wikilabels: install nodejs package [puppet] - 10https://gerrit.wikimedia.org/r/329316 (https://phabricator.wikimedia.org/T154122) [03:24:24] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 722.70 seconds [03:29:24] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 173.69 seconds [03:29:44] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [04:09:04] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=307.10 Read Requests/Sec=275.80 Write Requests/Sec=0.50 KBytes Read/Sec=35266.40 KBytes_Written/Sec=4.40 [04:12:28] 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903056 (10Peachey88) [04:17:04] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=13.40 Read Requests/Sec=0.50 Write Requests/Sec=8.20 KBytes Read/Sec=2.80 KBytes_Written/Sec=42.00 [04:26:24] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:54:24] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:06:06] !log starting deploy of ores:228b9b4 in canary nodes (T154168) [05:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:10] T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168 [05:06:18] !log ladsgroup@tin Starting deploy [ores/deploy@228b9b4]: (no message) [05:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:37] !log starting deploy of ores:228b9b4 in all nodes (T154168) [05:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:40] T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168 [05:25:05] !log ladsgroup@tin Finished deploy [ores/deploy@228b9b4]: (no message) (duration: 18m 46s) [05:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:27] !log finished deploy of ores:228b9b4 in all nodes (T154168) [05:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:30] T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168 [05:44:10] !log ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=wikidatawiki (T154168) [05:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:13] T154168: Quantity changes broke ORES - https://phabricator.wikimedia.org/T154168 [05:50:01] 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903017 (10Parent5446) Possible things to do: - [ ] Ensure the provided email matches the email of the account. Then send an email confirming the request. - [ ] Challenge him to provide his c... [05:55:16] 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903136 (10Jalexander) a:03Jalexander I can look into this Tuesday if no one else is able to, once verified we can remove 2fa. Will send an email to the wiki users address now to start a cou... [05:58:42] 06Operations: Two-factor authorisation reset request from user:Angelo.romano - https://phabricator.wikimedia.org/T154171#2903139 (10Parent5446) @Jalexander is there a specific tag for these requests / can we make one? It would be useful for tracking to figure out how high priority T131789 and related tasks shoul... [06:25:24] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:25:34] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:31:44] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:53:34] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:24] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:35:51] (03PS1) 10ArielGlenn: media title dumps: use explicit path to list of wikis with globaluseagelist [puppet] - 10https://gerrit.wikimedia.org/r/329323 [07:41:51] (03CR) 10ArielGlenn: [C: 032] media title dumps: use explicit path to list of wikis with globaluseagelist [puppet] - 10https://gerrit.wikimedia.org/r/329323 (owner: 10ArielGlenn) [08:23:23] 06Operations, 10DBA, 07Chinese-Sites: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#2903280 (10Shizhao) [08:51:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:52:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:54:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:59:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [09:02:42] <_joe_> uhm [09:02:46] <_joe_> checking [09:03:33] <_joe_> a transient problem, it seems [09:03:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:04:00] <_joe_> it was 500s, so MediaWiki errors [09:07:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:09:04] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Leave authn out of authz if clause [puppet] - 10https://gerrit.wikimedia.org/r/329326 [10:15:56] (03CR) 10Alexandros Kosiaris: [C: 032] k8s::apiserver: Leave authn out of authz if clause [puppet] - 10https://gerrit.wikimedia.org/r/329326 (owner: 10Alexandros Kosiaris) [10:52:14] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/ab50d9422116c55331ef766093548bd565d407b679e13ec8276463b9540d470a/shm is not accessible: Permission denied [10:52:14] PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/386db64ef8c000727e583d75837ff32fa95468bfb681230bb6fe0b630ae0dd79/shm is not accessible: Permission denied [10:52:50] <_joe_> ? [10:53:17] <_joe_> akosiaris: we might want to change check_disk on those machines [10:54:27] _joe_: just running my very first pods [10:54:38] <_joe_> yeah I figured [10:54:38] finally, after some PEBKACs last week [10:54:44] <_joe_> \o/ [10:56:26] funny how mount shows /var/lib/docker/devicemapper/mnt/9f80c2485e582ff1e551f374a62a842eca33e76e24bc2fc5b23a2019baa993e1 but df doesn't [10:56:30] xfs ? [10:56:34] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:56:50] why on earth are we using XFS there ? not that it matter much right now [10:57:09] but when we go production I 'd like to not have machines lock up due to xfs issues [11:18:02] <_joe_> akosiaris: ask docker [11:18:25] <_joe_> we did just create the lvm volumes [11:20:58] yeah docker info says XFS [11:21:12] anyway, will figure out at some other point. making a note for now [11:24:34] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:04:31] (03PS1) 10Tim Landscheidt: postgresql: Only set user password if different [puppet] - 10https://gerrit.wikimedia.org/r/329328 [12:09:33] (03CR) 10Tim Landscheidt: "Tested with a) user not existing (which causes pass_set to be triggered by create_user) and b) password changed ("ALTER ROLE replication P" [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt) [12:23:44] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:57] (03PS1) 10Tim Landscheidt: puppetdb: Do not set up Ganglia in Labs [puppet] - 10https://gerrit.wikimedia.org/r/329329 (https://phabricator.wikimedia.org/T154104) [12:31:17] 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2903464 (10scfc) a:03scfc [12:48:41] (03PS1) 10Tim Landscheidt: puppetdb: Do not hardcode puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) [12:51:39] (03CR) 10Tim Landscheidt: "Tested the inline_template() call and it is 1:1 the existing ferm rule for the production Hiera data; yet, of course, this requires a bit " [puppet] - 10https://gerrit.wikimedia.org/r/329330 (https://phabricator.wikimedia.org/T153577) (owner: 10Tim Landscheidt) [12:51:44] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:54:41] 07Puppet, 06Labs, 10Tool-Labs, 13Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2903500 (10scfc) a:03scfc [13:24:34] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [13:27:54] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:28:54] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational [13:35:16] <_joe_> uhm [13:35:26] <_joe_> what happened there akosiaris ? [13:45:05] (03PS1) 10Alexandros Kosiaris: kubernetes: Instruct docker to not handle iptables [puppet] - 10https://gerrit.wikimedia.org/r/329333 [13:45:07] (03PS1) 10Alexandros Kosiaris: Force docker bridge IP address per host [puppet] - 10https://gerrit.wikimedia.org/r/329334 [13:49:51] (03CR) 10Giuseppe Lavagetto: Force docker bridge IP address per host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/329334 (owner: 10Alexandros Kosiaris) [13:50:49] (03PS2) 10Alexandros Kosiaris: Force docker bridge IP address per host [puppet] - 10https://gerrit.wikimedia.org/r/329334 [13:52:34] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:54:09] (03PS1) 10Alexandros Kosiaris: kubernetes::apiserver: Fix admission_control if clause [puppet] - 10https://gerrit.wikimedia.org/r/329337 [13:55:48] (03CR) 10Alexandros Kosiaris: "Addressed Giuseppe comment (that answer was "by mistake"). PCC ran successfully and with expected output at https://puppet-compiler.wmflab" [puppet] - 10https://gerrit.wikimedia.org/r/329334 (owner: 10Alexandros Kosiaris) [13:56:29] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Instruct docker to not handle iptables [puppet] - 10https://gerrit.wikimedia.org/r/329333 (owner: 10Alexandros Kosiaris) [13:56:47] (03CR) 10Alexandros Kosiaris: [C: 032] Force docker bridge IP address per host [puppet] - 10https://gerrit.wikimedia.org/r/329334 (owner: 10Alexandros Kosiaris) [13:57:17] _joe_: me fighting with docker. Part of the problems fixed in patches above [13:57:44] <_joe_> cool [13:58:27] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes::apiserver: Fix admission_control if clause [puppet] - 10https://gerrit.wikimedia.org/r/329337 (owner: 10Alexandros Kosiaris) [14:01:24] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [14:01:24] RECOVERY - Disk space on kubernetes1002 is OK: DISK OK [14:01:34] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[docker] [14:02:34] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:07:24] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/a66cf08a28cba21908040c2259587c0aa39fe25a2e5fff452b5d7face8c9f33e/shm is not accessible: Permission denied [14:09:54] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:12:24] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/10f7cd0ee5b65452e14ef7799cdbb1243f38c5ddff530f0262b90150d29dea28/shm is not accessible: Permission denied [14:22:26] (03PS1) 10Giuseppe Lavagetto: openstack::horizon::service: use require_package for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/329339 [14:37:54] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:41:54] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:08] (03CR) 10Giuseppe Lavagetto: [C: 032] openstack::horizon::service: use require_package for keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/329339 (owner: 10Giuseppe Lavagetto) [15:09:24] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [15:10:17] <_joe_> I guess this is Jeff n Chris? [15:10:54] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:11:14] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 105, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/7: down - bariumBR [15:12:14] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 107, down: 0, dormant: 0, excluded: 2, unused: 0 [15:12:47] ACKNOWLEDGEMENT - Host barium is DOWN: PING CRITICAL - Packet loss = 100% Jeff_Green hard disk replacement [15:13:06] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:19:24] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [15:21:24] PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/f1fcdc1e23d40f3ccad307fd08a9c6420018ec2432a46f6afd7c96a4b984870e/shm is not accessible: Permission denied [15:24:24] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/6ba7762c0da40811ad17511e241197114c34a63ebce3905c71673f5581d4cb1c/shm is not accessible: Permission denied [15:37:14] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 105, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/7: down - bariumBR [15:39:14] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 107, down: 0, dormant: 0, excluded: 2, unused: 0 [15:43:24] RECOVERY - Disk space on kubernetes1002 is OK: DISK OK [15:55:01] (03Abandoned) 10Giuseppe Lavagetto: New release [debs/pybal] - 10https://gerrit.wikimedia.org/r/268629 (owner: 10Giuseppe Lavagetto) [15:58:14] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 105, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/7: down - bariumBR [15:59:44] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine] [15:59:44] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine] [15:59:44] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine] [15:59:45] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine] [16:00:05] (03PS1) 10Alexandros Kosiaris: kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344 [16:00:24] PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/c33e528ee164b21156a0d315f523f84a0bfe8cb45217d7572de00c407a2d6533/shm is not accessible: Permission denied [16:01:14] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 107, down: 0, dormant: 0, excluded: 2, unused: 0 [16:04:08] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344 (owner: 10Alexandros Kosiaris) [16:04:15] (03PS2) 10Alexandros Kosiaris: kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344 [16:04:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::worker: Bump docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329344 (owner: 10Alexandros Kosiaris) [16:07:59] (03PS1) 10Giuseppe Lavagetto: role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 [16:08:03] <_joe_> akosiaris: ^^ [16:11:51] (03PS1) 10Alexandros Kosiaris: kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346 [16:12:03] (03CR) 10Alexandros Kosiaris: [C: 032] role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 (owner: 10Giuseppe Lavagetto) [16:12:08] (03PS2) 10Alexandros Kosiaris: role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 (owner: 10Giuseppe Lavagetto) [16:12:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] role::kubernetes::worker: tweak disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329345 (owner: 10Giuseppe Lavagetto) [16:12:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346 (owner: 10Alexandros Kosiaris) [16:12:44] (03PS2) 10Alexandros Kosiaris: kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346 [16:12:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Specify correctly the docker version [puppet] - 10https://gerrit.wikimedia.org/r/329346 (owner: 10Alexandros Kosiaris) [16:14:44] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:14:44] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:14:44] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:14:54] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:17:05] (03PS1) 10Alexandros Kosiaris: kubernetes: Add /run/docker/netns/ as well in ignored disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329347 [16:23:50] <_joe_> !log power down prometheus2003 [16:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:24] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:56] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Add /run/docker/netns/ as well in ignored disk checks [puppet] - 10https://gerrit.wikimedia.org/r/329347 (owner: 10Alexandros Kosiaris) [16:32:14] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [16:32:40] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2903669 (10zhuyifei1999) [16:35:27] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2903671 (10Joe) My suggestion would be that these 10 new systems should replace mw2075 - mw2090 functionally, and specifically: - **3 servers** to replace the 5 API appservers mw2075-79 so in **row A** - **... [16:35:43] 06Operations: reinstall/reimage sinistra as mwlog2001 - https://phabricator.wikimedia.org/T153384#2903674 (10Papaul) [16:35:46] 06Operations, 10ops-codfw: update label/racktables visible label for mwlog2001 (was sinistra) - https://phabricator.wikimedia.org/T153771#2903672 (10Papaul) 05Open>03Resolved Complete [16:35:50] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2903676 (10zhuyifei1999) [[https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&v... [16:36:09] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2903678 (10Joe) a:05Joe>03RobH [16:36:24] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [16:39:24] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [16:42:24] RECOVERY - Disk space on kubernetes1002 is OK: DISK OK [16:44:28] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857513 (10Papaul) @joe are those systems already decommissioned? [16:45:14] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [16:46:04] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:59:41] (03PS1) 10Alexandros Kosiaris: Production kubernetes: Specify the service IP range [puppet] - 10https://gerrit.wikimedia.org/r/329350 [17:06:53] (03PS1) 10Alexandros Kosiaris: kubernetes apiserver: Allow specifying > 1 apiserver [puppet] - 10https://gerrit.wikimedia.org/r/329351 [17:10:51] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/4999/ says diff is fine, merging into tool labs first to ensure no breakage" [puppet] - 10https://gerrit.wikimedia.org/r/329351 (owner: 10Alexandros Kosiaris) [17:14:19] (03CR) 10Alexandros Kosiaris: "cherry picked in tool labs, was a noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/329351 (owner: 10Alexandros Kosiaris) [17:14:28] (03CR) 10Alexandros Kosiaris: [C: 032] Production kubernetes: Specify the service IP range [puppet] - 10https://gerrit.wikimedia.org/r/329350 (owner: 10Alexandros Kosiaris) [17:14:31] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes apiserver: Allow specifying > 1 apiserver [puppet] - 10https://gerrit.wikimedia.org/r/329351 (owner: 10Alexandros Kosiaris) [17:15:04] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:28:47] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2903705 (10Papaul) 05Open>03Resolved IDRAC card extention replacement complete. The system is back up and using the the dedicated card. [17:30:44] PROBLEM - Host db2035 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:14] RECOVERY - Host db2035 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [17:35:09] PROBLEM - mysqld processes on db2035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:35:09] PROBLEM - MariaDB Slave SQL: s2 on db2035 is CRITICAL: CRITICAL slave_sql_state could not connect [17:35:09] PROBLEM - MariaDB Slave IO: s2 on db2035 is CRITICAL: CRITICAL slave_io_state could not connect [17:41:54] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:47:08] RECOVERY - mysqld processes on db2035 is OK: PROCS OK: 1 process with command name mysqld [17:49:04] RECOVERY - MariaDB Slave SQL: s2 on db2035 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:49:04] RECOVERY - MariaDB Slave IO: s2 on db2035 is OK: OK slave_io_state Slave_IO_Running: Yes [17:52:15] 06Operations, 10ops-codfw, 10DBA: db2035 reset - https://phabricator.wikimedia.org/T154189#2903726 (10akosiaris) [17:52:37] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2903738 (10Shoichi) @awight @Niharika @MaxSem: Hi , my translation team started to work. Our work is in https://github.com/Wikimedia-TW/han3_ji7_tsoo1... [17:55:19] papaul: around ? [17:57:54] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 27.87 seconds [17:59:53] 06Operations, 10ops-codfw, 10DBA: db2035 reset - https://phabricator.wikimedia.org/T154189#2903739 (10akosiaris) FWIW, those times seem wrong. According to the OS, the reboot happened on 17:31 UTC and not 16:46, so the time was most probably wrong. That is also further validated by the following log line in... [18:02:14] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:14] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [18:23:34] 06Operations, 10ops-codfw, 10DBA: db2035 reset - https://phabricator.wikimedia.org/T154189#2903748 (10akosiaris) Logs indicate that the shutdown was via a graceful via USB keyboard and mouse being plugged in. @Papaul says it is possibly himself while working on db2034 on an independent issue and rebooting th... [18:48:44] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:26] _joe_: are you around by any chance ? [19:16:44] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:16:46] should be gone [19:20:44] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:34:44] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [19:43:44] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:48:44] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:01:44] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [20:01:54] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:11:44] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:13:34] <_joe_> matanya: I am now, what's up? [20:13:52] <_joe_> matanya: not for long though, unless it's an emergency [20:24:54] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:52:54] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:05:04] RECOVERY - check_raid on barium is OK: OK: MegaSAS 1 logical, 2 physical [21:53:18] _joe_: it was about requesting permission to add more workers to the video2commons project, if the video scalers catched up [21:53:23] but it can wait [22:01:04] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [22:06:24] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Transit: Init7 (donated) {#14009} [10Gbps]BR [22:07:24] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [22:23:44] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:01] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox) [22:30:14] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:31:03] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904065 (10Paladox) p:05Triage>03Unbreak! Changing priority to break as someone needs to investigate it's impact on prod, as i managed to reproduce it in labs... [22:33:33] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904069 (10Paladox) This bug may have caused T153079 since that problem happened around the time after we upgraded to gerrit 2.13. [22:34:57] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904053 (10Paladox) the data loss includes merged changes [22:51:44] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [23:14:43] is anyone around with permissions on the fundraising server [23:14:58] eileen1: Probably not.. What's up? [23:15:01] there is a jenkins issue - looks like a not too hard fix with permissions..... [23:15:12] http://community.bonitasoft.com/hudson-jobs-missing-after-crash-restore-them-ashes [23:16:25] I'll try to phone jeff_green [23:16:31] I was just going to say the same thing :) [23:16:50] And/or David Strine if you need someone to coordindate [23:17:33] And even upto Katie if it's really blocking you :) [23:17:35] eileen1: ^ [23:18:07] Thanks - it's delayed fall out from a problem Jeff was dealing with [23:18:19] I was gonna say, I think he was about an hour or so ago [23:18:26] oh, 2 [23:18:44] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:18] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904128 (10Paladox) [23:23:28] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss - https://phabricator.wikimedia.org/T154205#2904130 (10Paladox) Reported here https://bugs.chromium.org/p/gerrit/issues/detail?id=5200 [23:29:47] I left a message on his phone [23:35:24] he is online now - thanks Reedy for the encouragement :-) [23:36:32] :) [23:36:37] Good good [23:47:44] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:59:44] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues