[00:35:21] PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:01] PROBLEM - puppet last run on cloudelastic1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:51] RECOVERY - puppet last run on db1117 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:18:29] RECOVERY - puppet last run on cloudelastic1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:50:05] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:16:37] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:16:37] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21283000 and 0 seconds [02:16:37] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17784848 and 0 seconds [02:17:57] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16424 and 46 seconds [02:17:58] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4144 and 46 seconds [03:01:33] PROBLEM - puppet last run on db1118 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:33:23] RECOVERY - puppet last run on db1118 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:15:33] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [04:17:49] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:44:17] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:08:43] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=LIST https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:11:21] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [05:19:39] !log Clean up some space on webperf2001 - T221508 [05:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:45] T221508: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 [05:20:33] RECOVERY - Disk space on webperf2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [05:22:52] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10Marostegui) The host was fully full: ` root@webperf2001:/var/log# df -hT Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 3.9G 0 3.9G 0% /dev tmpfs... [05:26:37] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational [05:29:40] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10Marostegui) This will get full in a matter of minutes again: ` root@webperf2001:/var/log# ls -lh messages user.log -rw-r----- 1 root adm 1.3G Apr 21 05:27 messages -rw-r----- 1 root adm 989... [05:33:27] PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:45:39] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:05:17] RECOVERY - puppet last run on aluminium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [06:16:59] PROBLEM - Disk space on webperf2001 is CRITICAL: DISK CRITICAL - free space: / 1519 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:28:21] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:32:09] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:56:35] ACKNOWLEDGEMENT - HP RAID on db2037 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T221512 [06:56:40] 10Operations, 10ops-codfw: Degraded RAID on db2037 - https://phabricator.wikimedia.org/T221512 (10ops-monitoring-bot) [06:58:37] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:01:05] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2037 - https://phabricator.wikimedia.org/T221512 (10Marostegui) p:05Triage→03Normal a:03Papaul Let's get it replaced Thanks! [07:02:12] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [08:00:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 94 probes of 406 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:00:45] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 129 probes of 445 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:16:19] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 13 probes of 406 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:16:39] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 445 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [09:52:27] (03PS1) 10ArielGlenn: enable use of lbzip2 for revision history dumps for all big wikis [puppet] - 10https://gerrit.wikimedia.org/r/505441 [10:02:29] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 3 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Jc86035) Should T221380 be merged or made a subtask? It seems like the same issue, but I would... [10:11:19] (03CR) 10Alex Monk: "PS3 fixes port numbers." [software/swift-ring] - 10https://gerrit.wikimedia.org/r/503714 (owner: 10Alex Monk) [12:13:53] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:15:15] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [13:12:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:59:07] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:18:55] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:25:33] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:45:25] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:01] PROBLEM - puppet last run on logstash1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:57] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10Krinkle) There should be only one instance of each on a given webperfx001 instance: * statsv.py * navtiming.py * coal.py It would appear there are multiple in that snapshot. Afaik that's... [15:17:17] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10Krinkle) There seem to be several major points in time where something significant happened on webperf2001 in the last 7 days. Comparing webperf1001 and webperf2001 which have the same rol... [15:21:49] RECOVERY - puppet last run on logstash1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:16:55] (03CR) 10Andrew Bogott: [C: 03+2] cloud dns: move primary services to cloud-ns0 and cloud-ns1 [puppet] - 10https://gerrit.wikimedia.org/r/504572 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [16:20:02] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) As far as I remember (it has been a while) all the stuff that was sent for me to review was reviewed and I believe... [17:12:28] 10Operations, 10Core Platform Team, 10DBA, 10MediaWiki-Database, and 3 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10Jeff_G) A workaround is to start with [[Special:Log/block]] or https://commons.wikimedia.org/w... [17:30:13] PROBLEM - Check for gridmaster host resolution UDP on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:04:15] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:47] Hmm [18:06:03] I wonder why puppet failed on ^^ [18:36:03] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:38:56] I think it heard you paladox. [18:39:10] Lol [18:41:31] paladox: there have been a lot of spurious 'catalog fetch fail' errors lately [18:41:48] something unknown happened ~end of March or so that increased the rate of them to several a day [18:41:57] Ah [18:42:00] Thanks [18:44:51] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:49:05] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) In your opinion, are there any problems you can foresee from a database design perspective? [18:55:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:55:30] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Krinkle) @Krenair I was referring to this bit: >>! From the **Task description**: > * Prioritize which "junk" doma... [20:24:53] 10Operations: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jijiki) [20:26:14] 10Operations: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jijiki) [20:28:16] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10jijiki) [20:43:06] 10Operations, 10Performance-Team: webperf2001 is running ouf of disk space - https://phabricator.wikimedia.org/T221508 (10jijiki) I am afraid I do not know much either about the services on this server so to perform any actions. @Krinkle Is there something we can do for the time being? What problems are we hav... [21:46:55] PROBLEM - puppet last run on kubestage1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:50:30] 10Operations, 10Performance-Team: webperf2001 is running out of disk space - https://phabricator.wikimedia.org/T221508 (10Aklapper) [22:13:21] RECOVERY - puppet last run on kubestage1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:10:41] (03PS1) 10Alex Monk: auth pdns: bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) [23:13:33] (03PS2) 10Alex Monk: auth pdns: bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) [23:13:40] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) (owner: 10Alex Monk) [23:14:19] (03CR) 10jerkins-bot: [V: 04-1] auth pdns: bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) (owner: 10Alex Monk) [23:14:58] (03CR) 10jerkins-bot: [V: 04-1] auth pdns: bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) (owner: 10Alex Monk) [23:18:16] I wonder what the mysterious blank hostname that failed completely at https://puppet-compiler.wmflabs.org/compiler1002/162/ is [23:21:04] (03PS3) 10Alex Monk: auth pdns: bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/505477 (https://phabricator.wikimedia.org/T221527) [23:43:37] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10CDanis) A bit out of date, but {P8336} [23:44:23] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:57:20] (03CR) 10Faidon Liambotis: [C: 04-1] coherence report: General improvements and rack checks (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov)