[00:05:53] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:09:23] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) @eevans Alright, despite the issues above the server has been reinstalled now and is on st... [00:09:42] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) a:05Cmjohnson→03Eevans cc: @fgiunchedi [00:10:37] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:13:02] (03Abandoned) 10Huji: Dissallow eliminators to block certain groups on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476503 (https://phabricator.wikimedia.org/T210642) (owner: 10Huji) [00:13:45] (03PS5) 10Huji: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) [00:14:09] (03CR) 10Huji: "Yes. It has been sitting here for a long time despite community consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [00:22:34] PROBLEM - MD RAID on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:28:32] PROBLEM - Check size of conntrack table on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:28:32] PROBLEM - configured eth on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:30:36] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:36] PROBLEM - dhclient process on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:31:19] ^ ehm... i reinstalled that earlier. looking [00:32:24] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:32:34] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [00:32:34] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:33:27] !log wikitech-static - fix /etc/letsencrypt/renewal/wikitech-static.wikimedia.org.conf - remove webroot_map and and line for status.wm.org that caused errors when doing a renewal dry-run. now dry run finishes succesfully and we are using "webroot" authenticator and not "apache" anymore. This should have resolved what this ticket was about. No more Apache kills/restarts on renewal. (T214640) [00:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:37] T214640: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640 [00:34:34] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:34:34] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:35:48] !log restbase-dev1006 - starting nagios-nrpe-server [00:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:34] PROBLEM - DPKG on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:36:34] ACKNOWLEDGEMENT - puppet last run on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T224260 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:36:34] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T224260 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:39:16] PROBLEM - IPMI Sensor Status on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [00:39:20] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:40:24] PROBLEM - Disk space on restbase-dev1006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.10: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops [00:42:21] weird, puppet mysteriously works now but instead we have these :p [00:50:07] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) actually.. puppet run is not failing anymore now. :) though.. i had to restart nagios-n... [01:00:54] (03CR) 10Eevans: table-properties: Initial commit (032 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [01:04:54] mutante: puppet runs without error? [01:05:35] mutante: this machine is cursed [01:05:49] we should cleanse it with fire [01:06:04] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 52.44, 23.52, 12.94 https://wikitech.wikimedia.org/wiki/Application_servers [01:07:37] "Filesystem available is greater than filesystem size" [01:07:41] nice. [01:07:42] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 25.13, 23.44, 14.04 https://wikitech.wikimedia.org/wiki/Application_servers [01:07:56] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 76.14, 34.75, 19.29 https://wikitech.wikimedia.org/wiki/Application_servers [01:08:28] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [01:08:28] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:08:36] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:08:40] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [01:08:40] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:08:44] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [01:08:44] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi [01:08:44] es/Monitoring/recommendation_api [01:08:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:08:54] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:09:00] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:09:02] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece [01:09:02] tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:09:02] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:09:02] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:09:08] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:09:08] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:09:18] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:09:32] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:09:34] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 25.63, 29.97, 19.24 https://wikitech.wikimedia.org/wiki/Application_servers [01:10:02] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:06] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.327 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:10:18] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:22] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:10:24] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:10:28] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:10:30] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 73.06, 38.94, 23.47 https://wikitech.wikimedia.org/wiki/Application_servers [01:10:34] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:10:34] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:10:34] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:40] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:42] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:52] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:11:00] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 76371 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:11:00] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 56.93, 29.03, 16.28 https://wikitech.wikimedia.org/wiki/Application_servers [01:11:02] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 61.71, 31.34, 18.03 https://wikitech.wikimedia.org/wiki/Application_servers [01:11:02] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:11:24] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 73.02, 39.14, 21.23 https://wikitech.wikimedia.org/wiki/Application_servers [01:11:32] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 71.98, 36.73, 19.49 https://wikitech.wikimedia.org/wiki/Application_servers [01:11:42] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.63, 29.31, 16.81 https://wikitech.wikimedia.org/wiki/Application_servers [01:11:48] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:11:48] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:11:54] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:12:20] (03CR) 10Eevans: table-properties: Initial commit (031 comment) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [01:13:18] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 19.44, 24.53, 16.33 https://wikitech.wikimedia.org/wiki/Application_servers [01:14:14] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 12.77, 23.76, 16.99 https://wikitech.wikimedia.org/wiki/Application_servers [01:14:16] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 13.66, 24.15, 17.94 https://wikitech.wikimedia.org/wiki/Application_servers [01:15:00] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:15:18] RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 17.72, 35.87, 27.61 https://wikitech.wikimedia.org/wiki/Application_servers [01:16:22] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 8.30, 20.88, 17.60 https://wikitech.wikimedia.org/wiki/Application_servers [01:19:28] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 8.32, 21.63, 22.08 https://wikitech.wikimedia.org/wiki/Application_servers [01:39:08] RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:43:02] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) >>! In T224260#5370545, @Dzahn wrote: > @eevans Alright, despite the issues above the ser... [01:43:04] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:50:12] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1011 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:50:14] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:50:28] PROBLEM - puppet last run on dbproxy1013 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:10:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:12:20] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:16:44] PROBLEM - Host mw1300 is DOWN: PING CRITICAL - Packet loss = 100% [02:18:14] RECOVERY - puppet last run on wtp1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:18:26] RECOVERY - puppet last run on dbproxy1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:38:05] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Eevans) >>! In T224260#5370545, @Dzahn wrote: > @eevans Alright, despite the issues above the ser... [05:42:10] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:53:44] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms [05:55:30] PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=actinium&var-datasource=eqiad+prometheus/ops [05:56:48] (03PS4) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) [06:01:30] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:01:52] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:11:45] (03CR) 10Marostegui: Use GTIDs for master position queries for external DB when possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [06:21:48] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 64.22, 28.53, 15.35 https://wikitech.wikimedia.org/wiki/Application_servers [06:22:10] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 67.96, 29.78, 15.87 https://wikitech.wikimedia.org/wiki/Application_servers [06:23:22] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 80.93, 37.42, 19.16 https://wikitech.wikimedia.org/wiki/Application_servers [06:24:10] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:24:40] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 57.25, 30.81, 17.24 https://wikitech.wikimedia.org/wiki/Application_servers [06:26:08] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:26:16] RECOVERY - Disk space on actinium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=actinium&var-datasource=eqiad+prometheus/ops [06:29:26] PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:02] PROBLEM - puppet last run on ms-be1049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_timedatectl] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:35:16] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 75.53, 45.19, 27.25 https://wikitech.wikimedia.org/wiki/Application_servers [06:35:20] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 70.96, 43.01, 27.05 https://wikitech.wikimedia.org/wiki/Application_servers [06:35:30] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 50.40, 35.79, 23.02 https://wikitech.wikimedia.org/wiki/Application_servers [06:37:26] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:38:58] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:40:16] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 10.95, 24.91, 22.23 https://wikitech.wikimedia.org/wiki/Application_servers [06:43:41] !log powercycle mw1300 - no ssh, serial com2 stuck with no root loging available [06:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:54] RECOVERY - Host mw1300 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [06:48:00] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 7.16, 14.43, 22.62 https://wikitech.wikimedia.org/wiki/Application_servers [06:49:32] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 6.84, 13.36, 22.92 https://wikitech.wikimedia.org/wiki/Application_servers [06:55:14] (03CR) 10Aaron Schulz: "Just trying to simplify the config and reduce repetition." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [06:55:22] RECOVERY - puppet last run on ms-be1049 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:56:32] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 6.91, 9.11, 22.74 https://wikitech.wikimedia.org/wiki/Application_servers [06:56:50] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 6.94, 8.74, 22.28 https://wikitech.wikimedia.org/wiki/Application_servers [06:57:24] RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:50] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 7.40, 9.06, 23.94 https://wikitech.wikimedia.org/wiki/Application_servers [07:00:06] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 7.29, 8.06, 22.49 https://wikitech.wikimedia.org/wiki/Application_servers [07:11:38] PROBLEM - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:11:39] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T229156 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:11:42] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10ops-monitoring-bot) [07:12:14] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1546970425[4](2019-07-23T19:26:29.990Z), dewiki_content_1521846803[4](2019-07-23T19:26:40.352Z), eswiki_content_1521891951[6](2019-07-23T19:26:29.987Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:27:35] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Peachey88) [08:02:55] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Peachey88) @Nuria: It was worked out on IRC that they probably need their Hue account created, since they already have NDA LDAP access, see: https://wikitech.wikimedia.org/wiki/Analyt... [10:22:20] RECOVERY - MegaRAID on cloudvirt1018 is OK: OK: optimal, 1 logical, 8 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:21:00] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [11:32:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:33:16] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [12:36:30] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [12:44:30] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [12:47:42] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [12:52:32] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [13:40:38] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:06:26] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_syslog.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:24:04] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:24:24] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:41:06] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:41:08] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:41:40] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:42:02] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:42:36] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:42:42] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:43:24] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:45:02] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:47:12] 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link [14:49:52] !log bounce rsyslog on wezen / centrallog1001 [14:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:58] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [14:51:22] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_syslog.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:53:38] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 756 days) https://wikitech.wikimedia.org/wiki/Logs [14:57:12] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link [14:59:20] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:59:20] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [16:03:58] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.66, 30.32, 24.27 https://wikitech.wikimedia.org/wiki/Application_servers [16:23:12] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 14.80, 16.54, 23.72 https://wikitech.wikimedia.org/wiki/Application_servers [16:35:19] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Tnegrin) Approved [17:39:07] !log Updated profile & images for @wikimediatech twitter account [17:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:37] bd808, lol nice [17:44:53] the old profile was boring :) [19:03:40] (03PS1) 10Aaron Schulz: [DNM] Use DBO_DEFAULT for extension1 since it is not for key/value blob storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 [19:09:10] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:02] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:14] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:36:44] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:39:06] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:56] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:01] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10Urbanecm) 05Open→03Resolved The issue that was present all the time was resolved, the last file I wasn't able t... [21:25:32] 10Operations, 10Wikimedia-General-or-Unknown, 10Security: Massive spambot registrations at dinwiki - https://phabricator.wikimedia.org/T212519 (10Aklapper) @sbassett: Is there anything actionable left in this task? [22:27:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) 05Resolved→03Open a:05Andrew→03wiki_willy I put this system under a realistic load today (running ~80 VMs)...