[00:04:43] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for surveys and other type of aggregated analysis - https://phabricator.wikimedia.org/T190598#4077813 (10mcruzWMF) [00:12:37] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:14:37] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for surveys and other type of aggregated analysis - https://phabricator.wikimedia.org/T190598#4077825 (10mcruzWMF) p:05Triage>03High [00:16:41] (03PS4) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [00:16:43] (03PS3) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [00:19:03] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban), 10User-greg: Please create a phame blog for surveys and other type of aggregated analysis - https://phabricator.wikimedia.org/T190598#4077813 (10Paladox) I doin't think this needs to be tagged as #operations or #traffic . cc @... [00:20:45] (03Abandoned) 10Rduran: Add flake8 config and requirement [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420015 (owner: 10Rduran) [00:21:39] (03PS1) 10Alexandros Kosiaris: Add wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/421680 [00:39:34] !log Correct retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/mw/*) [00:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:42] T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [00:55:58] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:27] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:28] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:01:37] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:01:37] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:01:38] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:01:38] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:47] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:47] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:48] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:48] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:57] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:01:58] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:01:58] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:01:58] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:02:07] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:07] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:02:07] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:08] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:02:17] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3048_v4, cp3048_v6 [01:02:17] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:17] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:18] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:18] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:27] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:27] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:28] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:28] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:02:37] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3048_v4, cp3048_v6 [01:02:37] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 63 not-conn: cp3048_v4, cp3048_v6 no-xfrm: cp3044_v4 [01:02:37] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:02:38] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:02:38] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:02:47] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 90 not-conn: cp3048_v4, cp3048_v6 [01:27:03] !log Correct retention rules for Whisper files on graphite2001 and graphite1001 per T179622 (/var/lib/carbon/whisper/VisualEditor/*) [01:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:09] T179622: Update our Graphite metrics for current retention config - https://phabricator.wikimedia.org/T179622 [01:55:05] 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4077947 (10Krinkle) [01:55:19] 10Operations, 10Cloud-Services, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921#4058041 (10Krinkle) [02:30:38] (03CR) 10Liuxinyu970226: [C: 031] Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) (owner: 10Urbanecm) [02:31:56] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp3048.esams.wmnet [02:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:28] (03CR) 10Liuxinyu970226: [C: 031] Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [02:33:03] !log powercycle cp3048 [02:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:18] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [02:35:27] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [02:35:27] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.86 ms [02:35:27] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [02:35:28] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [02:35:28] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 92 ESP OK [02:35:37] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [02:35:37] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [02:35:37] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [02:35:37] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 92 ESP OK [02:35:38] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [02:35:47] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [02:35:47] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [02:35:47] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [02:35:47] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [02:35:47] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [02:35:57] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [02:35:57] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [02:35:57] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 92 ESP OK [02:35:57] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 92 ESP OK [02:35:58] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [02:35:58] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [02:35:58] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [02:35:59] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [02:35:59] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 92 ESP OK [02:36:07] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 92 ESP OK [02:36:07] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 92 ESP OK [02:36:07] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 92 ESP OK [02:36:07] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 92 ESP OK [02:36:08] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 92 ESP OK [02:36:08] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [02:36:17] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [02:36:18] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [02:36:18] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [02:49:08] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4077990 (10Liuxinyu970226) [02:52:19] 10Operations, 10ops-esams, 10Traffic: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607#4078006 (10BBlack) p:05Triage>03Normal [03:14:37] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:26:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 807.67 seconds [03:28:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [03:28:09] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [03:28:58] ^ i think that may accidentally be me, transfering data from codfw elasticsearch to eqiad hadoop. It's going back down now [03:30:45] network on the codfw es cluster at the time was 2.4GBps tx, but i think only half of that was going to eqiad, and the other half was inner-cluster traffic [03:42:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [03:43:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [03:44:37] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:55:08] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.66 seconds [05:49:27] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 91.00, 26.41, 14.96 [05:50:28] PROBLEM - Disk space on Hadoop worker on analytics1051 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 25 GB (0% inode=99%): /var/lib/hadoop/data/g 24 GB (0% inode=99%): /var/lib/hadoop/data/l 21 GB (0% inode=99%): /var/lib/hadoop/data/d 24 GB (0% inode=99%): /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/c 18 GB (0% inode=99%): /var/lib/hadoop/data/h 25 GB (0% inode=99%): /var/lib/hadoop/data [05:50:28] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/b 24 GB (0% inode=99%): /var/lib/hadoop/data/f 22 GB (0% inode=99%): /var/lib/hadoop/data/i 24 GB (0% inode=99%) [05:51:17] RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 22.40, 21.19, 14.35 [06:01:27] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 61203 MB (12% inode=99%) [06:08:08] PROBLEM - Disk space on Hadoop worker on analytics1046 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 20 GB (0% inode=99%): /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/f 25 GB (0% inode=99%): /var/lib/hadoop/data/c 16 GB (0% inode=99%): /var/lib/hadoop/data/j 21 GB (0% inode=99%): /var/lib/hadoop/data/g 19 GB (0% inode=99%): /var/lib/hadoop/data/d 19 GB (0% inode=99%): /var/lib/hadoop/data [06:08:08] 99%): /var/lib/hadoop/data/h 19 GB (0% inode=99%): /var/lib/hadoop/data/b 23 GB (0% inode=99%): /var/lib/hadoop/data/i 24 GB (0% inode=99%): /var/lib/hadoop/data/m 22 GB (0% inode=99%) [06:10:28] RECOVERY - Disk space on elastic1027 is OK: DISK OK [06:22:48] PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 22 GB (0% inode=99%): /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/k 22 GB (0% inode=99%): /var/lib/hadoop/data/c 20 GB (0% inode=99%): /var/lib/hadoop/data/g 21 GB (0% inode=99%): /var/lib/hadoop/data/j 22 GB (0% inode=99%): /var/lib/hadoop/data/f 24 GB (0% inode=99%): /var/lib/hadoop/data [06:22:48] 99%): /var/lib/hadoop/data/d 19 GB (0% inode=99%): /var/lib/hadoop/data/m 18 GB (0% inode=99%): /var/lib/hadoop/data/i 23 GB (0% inode=99%): /var/lib/hadoop/data/b 16 GB (0% inode=99%) [06:27:48] PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 22 GB (0% inode=99%): /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/k 21 GB (0% inode=99%): /var/lib/hadoop/data/c 19 GB (0% inode=99%): /var/lib/hadoop/data/g 21 GB (0% inode=99%): /var/lib/hadoop/data/j 22 GB (0% inode=99%): /var/lib/hadoop/data/f 25 GB (0% inode=99%): /var/lib/hadoop/data [06:27:48] 99%): /var/lib/hadoop/data/d 19 GB (0% inode=99%): /var/lib/hadoop/data/m 18 GB (0% inode=99%): /var/lib/hadoop/data/i 22 GB (0% inode=99%): /var/lib/hadoop/data/b 16 GB (0% inode=99%) [07:12:23] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4078152 (10Joe) [07:51:12] (03PS8) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [07:59:15] (03PS9) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [07:59:31] (03CR) 10jerkins-bot: [V: 04-1] Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn) [08:00:57] PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 22 GB (0% inode=99%): /var/lib/hadoop/data/e 21 GB (0% inode=99%): /var/lib/hadoop/data/k 19 GB (0% inode=99%): /var/lib/hadoop/data/c 18 GB (0% inode=99%): /var/lib/hadoop/data/g 21 GB (0% inode=99%): /var/lib/hadoop/data/j 19 GB (0% inode=99%): /var/lib/hadoop/data/f 23 GB (0% inode=99%): /var/lib/hadoop/data [08:00:58] 99%): /var/lib/hadoop/data/d 17 GB (0% inode=99%): /var/lib/hadoop/data/m 17 GB (0% inode=99%): /var/lib/hadoop/data/i 20 GB (0% inode=99%): /var/lib/hadoop/data/b 16 GB (0% inode=99%) [08:03:38] (03PS10) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [08:07:51] (03PS11) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [08:12:13] (03PS12) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [08:12:48] RECOVERY - Disk space on Hadoop worker on analytics1051 is OK: DISK OK [08:12:58] RECOVERY - Disk space on Hadoop worker on analytics1050 is OK: DISK OK [08:13:27] RECOVERY - Disk space on Hadoop worker on analytics1046 is OK: DISK OK [08:16:59] (03PS13) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [08:37:37] (03PS14) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [09:02:14] (03PS15) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [09:14:30] (03CR) 10ArielGlenn: "This is a no-op for ferm rules and rsync config, see https://puppet-compiler.wmflabs.org/compiler03/10644/" [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn) [09:19:27] (03PS16) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [09:20:09] (03PS1) 10Daimona Eaytoy: Change wording for block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602) [09:52:50] (03PS2) 10Daimona Eaytoy: Change wording for AbuseFilter global block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421691 (https://phabricator.wikimedia.org/T190602) [14:00:18] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61574 MB (12% inode=99%) [14:07:07] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [14:08:07] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [14:10:07] PROBLEM - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.168 and port 9042: Connection refused [14:11:08] PROBLEM - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:12:18] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:12:57] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:20:18] RECOVERY - Disk space on elastic1025 is OK: DISK OK [14:32:57] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [14:33:18] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [14:34:17] RECOVERY - cassandra-a SSL 10.64.48.168:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-a valid until 2018-07-20 15:08:10 +0000 (expires in 118 days) [14:34:58] RECOVERY - cassandra-a CQL 10.64.48.168:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.168 port 9042 [14:38:19] (03PS1) 10Andrew Bogott: labweb oauth: use labtestwikitech, not prod wikitech [puppet] - 10https://gerrit.wikimedia.org/r/421702 (https://phabricator.wikimedia.org/T156276) [14:41:26] (03CR) 10Andrew Bogott: [C: 032] labweb oauth: use labtestwikitech, not prod wikitech [puppet] - 10https://gerrit.wikimedia.org/r/421702 (https://phabricator.wikimedia.org/T156276) (owner: 10Andrew Bogott) [14:59:27] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:00:29] !log rm -rf /srv/mediawiki/core on stat100[456] and force puppet run (git pull returned fatal: protocol error: bad pack header) [15:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:07] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:03:07] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:50:27] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 60930 MB (12% inode=99%) [15:52:27] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 60366 MB (12% inode=99%) [16:08:27] RECOVERY - Disk space on elastic1025 is OK: DISK OK [16:32:55] (03PS1) 10Alex Monk: openstack: Permit deployment-prep-dns-manager to log in from instance subnet [puppet] - 10https://gerrit.wikimedia.org/r/421709 (https://phabricator.wikimedia.org/T182927) [18:27:29] PROBLEM - nova-compute proc maximum on labvirt1012 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [18:28:28] RECOVERY - nova-compute proc maximum on labvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [19:45:38] (03PS1) 10MarcoAurelio: Add 'tboverride' to 'engineer' at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421721 (https://phabricator.wikimedia.org/T190619) [20:22:15] !log rm 2fa from Awight@officewiki [20:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:17] PROBLEM - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:39:28] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:39:28] PROBLEM - cassandra-b service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [20:39:57] PROBLEM - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is CRITICAL: connect to address 10.64.48.169 and port 9042: Connection refused [20:49:22] hello 118 [21:04:28] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [21:04:28] RECOVERY - cassandra-b service on restbase-dev1006 is OK: OK - cassandra-b is active [21:05:17] RECOVERY - cassandra-b SSL 10.64.48.169:7001 on restbase-dev1006 is OK: SSL OK - Certificate restbase-dev1006-b valid until 2018-07-20 15:08:11 +0000 (expires in 117 days) [21:05:57] RECOVERY - cassandra-b CQL 10.64.48.169:9042 on restbase-dev1006 is OK: TCP OK - 0.000 second response time on 10.64.48.169 port 9042 [21:24:52] 10Puppet, 10Beta-Cluster-Infrastructure: PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] - https://phabricator.wikimedia.org/T190632#4078788 (10MarcoAurelio) [21:26:51] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4078802 (10MarcoAurelio) [21:26:55] 10Puppet, 10Beta-Cluster-Infrastructure: PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] - https://phabricator.wikimedia.org/T190632#4078801 (10MarcoAurelio) [21:27:39] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192875 (10MarcoAurelio) [21:28:23] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192875 (10MarcoAurelio) [21:28:25] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4078807 (10MarcoAurelio) [21:38:28] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192875 (10MarcoAurelio) deployment-videoscaler01 seems no longer exist? ``` $ ssh -a deployment-videoscaler01 channel 0: open failed: connect failed: No route... [22:01:16] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4078843 (10MarcoAurelio) @Andrew et al. Some docs on Wikitech on usual puppet errors and how to fix them would IMHO help. I feel some of us who has access to de...