[00:04:43] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for surveys and other type of aggregated analysis - https://phabricator.wikimedia.org/T190598#4077813 (10mcruzWMF)
[00:12:37] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:14:37] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for surveys and other type of aggregated analysis - https://phabricator.wikimedia.org/T190598#4077825 (10mcruzWMF) p:05Triage>03High
[00:16:41] <wikibugs>	 (03PS4) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746
[00:16:43] <wikibugs>	 (03PS3) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340
[00:19:03] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for surveys and other type of aggregated analysis - https://phabricator.wikimedia.org/T190598#4077813 (10Paladox) I doin't think this needs to be tagged as #operations or #traffic .   cc @...
[00:20:45] <wikibugs>	 (03Abandoned) 10Rduran: Add flake8 config and requirement [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420015 (owner: 10Rduran)
[00:21:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421680
[00:39:34] <Krinkle>	 !log Correct retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/mw/*)
[00:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:42] <stashbot>	 T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622
[00:55:58] <icinga-wm>	 PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100%
[01:01:27] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:28] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:01:37] <icinga-wm>	 PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:01:37] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:01:38] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:01:38] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:47] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:47] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:48] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:48] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:57] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:01:58] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:01:58] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:01:58] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:02:07] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:07] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:02:07] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:08] <icinga-wm>	 PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:02:17] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6
[01:02:17] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:17] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:18] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:18] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:27] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:27] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:28] <icinga-wm>	 PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:28] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:02:37] <icinga-wm>	 PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6
[01:02:37] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 63 not-conn: cp3048_v4, cp3048_v6 no-xfrm: cp3044_v4
[01:02:37] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:02:38] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:02:38] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:02:47] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6
[01:27:03] <Krinkle>	 !log Correct retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/VisualEditor/*)
[01:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:09] <stashbot>	 T179622: Update our Graphite metrics for current retention config - https://phabricator.wikimedia.org/T179622
[01:55:05] <wikibugs>	 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4077947 (10Krinkle)
[01:55:19] <wikibugs>	 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4058041 (10Krinkle)
[02:30:38] <wikibugs>	 (03CR) 10Liuxinyu970226: [C: 031] Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) (owner: 10Urbanecm)
[02:31:56] <logmsgbot>	 !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp3048.esams.wmnet
[02:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:28] <wikibugs>	 (03CR) 10Liuxinyu970226: [C: 031] Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm)
[02:33:03] <bblack>	 !log powercycle cp3048
[02:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:18] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK
[02:35:27] <icinga-wm>	 RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK
[02:35:27] <icinga-wm>	 RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.86 ms
[02:35:27] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK
[02:35:28] <icinga-wm>	 RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK
[02:35:28] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 92 ESP OK
[02:35:37] <icinga-wm>	 RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK
[02:35:37] <icinga-wm>	 RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK
[02:35:37] <icinga-wm>	 RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK
[02:35:37] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 92 ESP OK
[02:35:38] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK
[02:35:47] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK
[02:35:47] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK
[02:35:47] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK
[02:35:47] <icinga-wm>	 RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK
[02:35:47] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK
[02:35:57] <icinga-wm>	 RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK
[02:35:57] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK
[02:35:57] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 92 ESP OK
[02:35:57] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 92 ESP OK
[02:35:58] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK
[02:35:58] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK
[02:35:58] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK
[02:35:59] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK
[02:35:59] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 92 ESP OK
[02:36:07] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 92 ESP OK
[02:36:07] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 92 ESP OK
[02:36:07] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 92 ESP OK
[02:36:07] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 92 ESP OK
[02:36:08] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 92 ESP OK
[02:36:08] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK
[02:36:17] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK
[02:36:18] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK
[02:36:18] <icinga-wm>	 RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK
[02:49:08] <wikibugs>	 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4077990 (10Liuxinyu970226)
[02:52:19] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607#4078006 (10BBlack) p:05Triage>03Normal
[03:14:37] <icinga-wm>	 PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:26:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 807.67 seconds
[03:28:05] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%
[03:28:09] <librenms-wmf>	 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80%
[03:28:58] <ebernhardson>	 ^ i think that may accidentally be me, transfering data from codfw elasticsearch to eqiad hadoop. It's going back down now
[03:30:45] <ebernhardson>	 network on the codfw es cluster at the time was 2.4GBps tx, but i think only half of that was going to eqiad, and the other half was inner-cluster traffic
[03:42:05] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80%
[03:43:05] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%
[03:44:37] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[03:55:08] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.66 seconds
[05:49:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 91.00, 26.41, 14.96
[05:50:28] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1051 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 25 GB (0% inode=99%): /var/lib/hadoop/data/g 24 GB (0% inode=99%): /var/lib/hadoop/data/l 21 GB (0% inode=99%): /var/lib/hadoop/data/d 24 GB (0% inode=99%): /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/c 18 GB (0% inode=99%): /var/lib/hadoop/data/h 25 GB (0% inode=99%): /var/lib/hadoop/data
[05:50:28] <icinga-wm>	 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/b 24 GB (0% inode=99%): /var/lib/hadoop/data/f 22 GB (0% inode=99%): /var/lib/hadoop/data/i 24 GB (0% inode=99%)
[05:51:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 22.40, 21.19, 14.35
[06:01:27] <icinga-wm>	 PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 61203 MB (12% inode=99%)
[06:08:08] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1046 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 20 GB (0% inode=99%): /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/f 25 GB (0% inode=99%): /var/lib/hadoop/data/c 16 GB (0% inode=99%): /var/lib/hadoop/data/j 21 GB (0% inode=99%): /var/lib/hadoop/data/g 19 GB (0% inode=99%): /var/lib/hadoop/data/d 19 GB (0% inode=99%): /var/lib/hadoop/data
[06:08:08] <icinga-wm>	 99%): /var/lib/hadoop/data/h 19 GB (0% inode=99%): /var/lib/hadoop/data/b 23 GB (0% inode=99%): /var/lib/hadoop/data/i 24 GB (0% inode=99%): /var/lib/hadoop/data/m 22 GB (0% inode=99%)
[06:10:28] <icinga-wm>	 RECOVERY - Disk space on elastic1027 is OK: DISK OK
[06:22:48] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 22 GB (0% inode=99%): /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/k 22 GB (0% inode=99%): /var/lib/hadoop/data/c 20 GB (0% inode=99%): /var/lib/hadoop/data/g 21 GB (0% inode=99%): /var/lib/hadoop/data/j 22 GB (0% inode=99%): /var/lib/hadoop/data/f 24 GB (0% inode=99%): /var/lib/hadoop/data
[06:22:48] <icinga-wm>	 99%): /var/lib/hadoop/data/d 19 GB (0% inode=99%): /var/lib/hadoop/data/m 18 GB (0% inode=99%): /var/lib/hadoop/data/i 23 GB (0% inode=99%): /var/lib/hadoop/data/b 16 GB (0% inode=99%)
[06:27:48] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 22 GB (0% inode=99%): /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/k 21 GB (0% inode=99%): /var/lib/hadoop/data/c 19 GB (0% inode=99%): /var/lib/hadoop/data/g 21 GB (0% inode=99%): /var/lib/hadoop/data/j 22 GB (0% inode=99%): /var/lib/hadoop/data/f 25 GB (0% inode=99%): /var/lib/hadoop/data
[06:27:48] <icinga-wm>	 99%): /var/lib/hadoop/data/d 19 GB (0% inode=99%): /var/lib/hadoop/data/m 18 GB (0% inode=99%): /var/lib/hadoop/data/i 22 GB (0% inode=99%): /var/lib/hadoop/data/b 16 GB (0% inode=99%)
[07:12:23] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4078152 (10Joe)
[07:51:12] <wikibugs>	 (03PS8) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[07:59:15] <wikibugs>	 (03PS9) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[07:59:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn)
[08:00:57] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 22 GB (0% inode=99%): /var/lib/hadoop/data/e 21 GB (0% inode=99%): /var/lib/hadoop/data/k 19 GB (0% inode=99%): /var/lib/hadoop/data/c 18 GB (0% inode=99%): /var/lib/hadoop/data/g 21 GB (0% inode=99%): /var/lib/hadoop/data/j 19 GB (0% inode=99%): /var/lib/hadoop/data/f 23 GB (0% inode=99%): /var/lib/hadoop/data
[08:00:58] <icinga-wm>	 99%): /var/lib/hadoop/data/d 17 GB (0% inode=99%): /var/lib/hadoop/data/m 17 GB (0% inode=99%): /var/lib/hadoop/data/i 20 GB (0% inode=99%): /var/lib/hadoop/data/b 16 GB (0% inode=99%)
[08:03:38] <wikibugs>	 (03PS10) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[08:07:51] <wikibugs>	 (03PS11) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[08:12:13] <wikibugs>	 (03PS12) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[08:12:48] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1051 is OK: DISK OK
[08:12:58] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1050 is OK: DISK OK
[08:13:27] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1046 is OK: DISK OK
[08:16:59] <wikibugs>	 (03PS13) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[08:37:37] <wikibugs>	 (03PS14) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[09:02:14] <wikibugs>	 (03PS15) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[09:14:30] <wikibugs>	 (03CR) 10ArielGlenn: "This is a no-op for ferm rules and rsync config, see https://puppet-compiler.wmflabs.org/compiler03/10644/" [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn)
[09:19:27] <wikibugs>	 (03PS16) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657)
[09:20:09] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Change wording for block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602)
[09:52:50] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Change wording for AbuseFilter global block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602)
[14:00:18] <icinga-wm>	 PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61574 MB (12% inode=99%)
[14:07:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received
[14:08:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[14:10:07] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused
[14:11:08] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:12:18] <icinga-wm>	 PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[14:12:57] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:20:18] <icinga-wm>	 RECOVERY - Disk space on elastic1025 is OK: DISK OK
[14:32:57] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[14:33:18] <icinga-wm>	 RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active
[14:34:17] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 118 days)
[14:34:58] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042
[14:38:19] <wikibugs>	 (03PS1) 10Andrew Bogott: labweb oauth: use labtestwikitech, not prod wikitech [puppet] - 10https://gerrit.wikimedia.org/r/421702 (https://phabricator.wikimedia.org/T156276)
[14:41:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labweb oauth: use labtestwikitech, not prod wikitech [puppet] - 10https://gerrit.wikimedia.org/r/421702 (https://phabricator.wikimedia.org/T156276) (owner: 10Andrew Bogott)
[14:59:27] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:00:29] <elukey>	 !log rm -rf /srv/mediawiki/core on stat100[456] and force puppet run (git pull returned fatal: protocol error: bad pack header)
[15:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:07] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[15:03:07] <icinga-wm>	 RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[15:50:27] <icinga-wm>	 PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 60930 MB (12% inode=99%)
[15:52:27] <icinga-wm>	 PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 60366 MB (12% inode=99%)
[16:08:27] <icinga-wm>	 RECOVERY - Disk space on elastic1025 is OK: DISK OK
[16:32:55] <wikibugs>	 (03PS1) 10Alex Monk: openstack: Permit deployment-prep-dns-manager to log in from instance subnet [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927)
[18:27:29] <icinga-wm>	 PROBLEM - nova-compute proc maximum on labvirt1012 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[18:28:28] <icinga-wm>	 RECOVERY - nova-compute proc maximum on labvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[19:45:38] <wikibugs>	 (03PS1) 10MarcoAurelio: Add 'tboverride' to 'engineer' at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421721 (https://phabricator.wikimedia.org/T190619)
[20:22:15] <foks>	 !log rm 2fa from Awight@officewiki
[20:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:17] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[20:39:28] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:39:28] <icinga-wm>	 PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[20:39:57] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused
[20:49:22] <NotASpy>	 hello 118
[21:04:28] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational
[21:04:28] <icinga-wm>	 RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active
[21:05:17] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 117 days)
[21:05:57] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042
[21:24:52] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] - https://phabricator.wikimedia.org/T190632#4078788 (10MarcoAurelio)
[21:26:51] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4078802 (10MarcoAurelio)
[21:26:55] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] - https://phabricator.wikimedia.org/T190632#4078801 (10MarcoAurelio)
[21:27:39] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192875 (10MarcoAurelio)
[21:28:23] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192875 (10MarcoAurelio)
[21:28:25] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4078807 (10MarcoAurelio)
[21:38:28] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192875 (10MarcoAurelio) deployment-videoscaler01 seems no longer exist?  ``` $ ssh -a deployment-videoscaler01 channel 0: open failed: connect failed: No route...
[22:01:16] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4078843 (10MarcoAurelio) @Andrew et al. Some docs on Wikitech on usual puppet errors and how to fix them would IMHO help. I feel some of us who has access to de...