[00:09:21] PROBLEM - Check systemd state on ms-be1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:10:39] RECOVERY - Check systemd state on ms-be1014 is OK: OK - running: The system is fully operational [00:11:15] !log reset passwords for FritzSolms@global and Seanhood@global [00:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:29] foks: looks like it works ?:) [00:11:40] this was on SUL :D [00:11:44] ah [00:11:46] the issue was with labs [00:11:54] I can try that now with myself maybe? [00:12:04] Unless there is a test account for this stuff [00:12:21] (03PS1) 10Bstorm: cloudstore: introduce rsync framework for secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/506847 (https://phabricator.wikimedia.org/T209527) [00:13:30] foks: there are 28 users starting with test* but i dont have any of their credntials [00:13:42] I might just make a new account for it [00:13:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:13:57] You can't create accounts on wikitech atm [00:14:07] admins can [00:14:09] Needs an admin to do it [00:14:19] oh yeah, duh [00:14:35] why is everything so hard lol [00:14:49] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:18:43] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:19:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:27:31] RECOVERY - Check systemd state on cloudstore1009 is OK: OK - running: The system is fully operational [00:32:22] (03PS1) 10Dzahn: transparency report: allow members of LDAP 'nda' to see private site [puppet] - 10https://gerrit.wikimedia.org/r/506848 (https://phabricator.wikimedia.org/T221744) [01:07:01] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:07:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:08:11] (03CR) 10Alex Monk: Allow puppet-merge to merge the labs/private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [01:18:43] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:19:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:34:31] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10Krenair) >>! In T220268#5123333, @Andrew wrote: > I'm wary of having a central repo of alternate puppetmasters (mostly bec... [01:55:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:56:39] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:00:31] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:00:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:06:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:08:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:32:22] (03PS1) 10Jforrester: Provide a temporary trwiki logo marking two years of censorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506849 [03:48:29] (03PS3) 10Tulsi Bhagat: Add namespace "Aldono" at eo.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 (https://phabricator.wikimedia.org/T221525) [04:01:30] (03CR) 10Tulsi Bhagat: "Requires `mwmaintenancescript namespaceDupes.php --wiki=eowiktionary --fix` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 (https://phabricator.wikimedia.org/T221525) (owner: 10Tulsi Bhagat) [04:51:47] PROBLEM - puppet last run on centrallog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:52:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:54:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [04:54:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:56:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:02:47] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:03:19] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:10:39] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational [05:18:15] RECOVERY - puppet last run on centrallog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:28:33] PROBLEM - puppet last run on cloudvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-intel-microcode] [06:31:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:31:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:31:31] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:32:11] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:32:51] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:33:21] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:33:41] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:34:17] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [06:49:56] (03PS1) 10Rxy: Allow admins to add or remove patroller group at enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506860 (https://phabricator.wikimedia.org/T222008) [06:52:03] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:53:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:54:48] (03CR) 10DannyS712: [C: 03+1] "Looks good, though it may make sense to also remove the add/remove ability from bureaucrats." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506860 (https://phabricator.wikimedia.org/T222008) (owner: 10Rxy) [06:56:25] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [06:57:05] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:57:59] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:17] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:17] RECOVERY - puppet last run on cloudvirt1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:08:15] PROBLEM - Check systemd state on ms-be1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:10:51] RECOVERY - Check systemd state on ms-be1014 is OK: OK - running: The system is fully operational [08:07:57] PROBLEM - Check systemd state on ms-be1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:09:46] (03PS1) 10Ema: upload VCL: stop thumb requests with bad file extension [puppet] - 10https://gerrit.wikimedia.org/r/506862 [08:10:33] RECOVERY - Check systemd state on ms-be1014 is OK: OK - running: The system is fully operational [08:24:13] PROBLEM - Check systemd state on ms-be1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:51:37] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:51:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:53:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:55:53] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:57:59] RECOVERY - Check systemd state on ms-be1015 is OK: OK - running: The system is fully operational [09:07:14] (03CR) 10Elukey: [C: 03+1] "Limited understanding of the problem but from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/495643/ it looks correct!" [puppet] - 10https://gerrit.wikimedia.org/r/506862 (owner: 10Ema) [09:08:02] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505765 (https://phabricator.wikimedia.org/T221525) (owner: 10Tulsi Bhagat) [09:08:31] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:10:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:12:36] (03CR) 10Ema: [C: 03+2] upload VCL: stop thumb requests with bad file extension [puppet] - 10https://gerrit.wikimedia.org/r/506862 (owner: 10Ema) [09:14:09] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [09:14:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [11:01:29] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) A reason why the SRE team is very strict in what Docker images... [11:44:23] PROBLEM - puppet last run on kubernetes1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:25] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:11] RECOVERY - puppet last run on kubernetes1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:31:55] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:37:12] !log stopping dbstore2002:s6 to clone it to db2097 T220572 [12:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [12:37:23] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:37:46] !log correcting last log, stopping dbstore2002:s1 to clone it to db2097 T220572 [12:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:37] PROBLEM - puppet last run on puppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:03:49] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:12:58] (03PS1) 10Ema: varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) [13:13:35] (03CR) 10jerkins-bot: [V: 04-1] varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [13:24:23] RECOVERY - puppet last run on puppetmaster1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:31:09] (03PS2) 10Ema: varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) [13:31:39] (03CR) 10jerkins-bot: [V: 04-1] varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [13:38:52] (03PS3) 10Ema: varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) [13:55:53] PROBLEM - Check systemd state on ms-be1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:57:09] RECOVERY - Check systemd state on ms-be1015 is OK: OK - running: The system is fully operational [14:45:37] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:59] (03PS4) 10Ema: varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) [15:09:30] (03PS5) 10Ema: varnish: run VTC tests against remote PCC [puppet] - 10https://gerrit.wikimedia.org/r/506868 (https://phabricator.wikimedia.org/T128188) [15:12:03] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:59:55] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2097 [puppet] - 10https://gerrit.wikimedia.org/r/506871 (https://phabricator.wikimedia.org/T220572) [16:16:24] (03PS1) 10Alex Monk: Puppet CAs: Make it easy to swap CAs by hiera change [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) [16:16:27] (03PS1) 10Alex Monk: Puppet certs: Move old client certs away when Puppet CA changes [puppet] - 10https://gerrit.wikimedia.org/r/506873 (https://phabricator.wikimedia.org/T220268) [16:17:15] (03CR) 10jerkins-bot: [V: 04-1] Puppet CAs: Make it easy to swap CAs by hiera change [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [16:17:22] (03CR) 10jerkins-bot: [V: 04-1] Puppet certs: Move old client certs away when Puppet CA changes [puppet] - 10https://gerrit.wikimedia.org/r/506873 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [16:19:16] (03PS2) 10Alex Monk: Puppet CAs: Make it easy to swap CAs by hiera change [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) [16:20:13] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:21:03] (03PS2) 10Alex Monk: Puppet certs: Move old client certs away when Puppet CA changes [puppet] - 10https://gerrit.wikimedia.org/r/506873 (https://phabricator.wikimedia.org/T220268) [16:21:23] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.379 second response time https://phabricator.wikimedia.org/T174916 [16:29:21] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:33:11] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.781 second response time https://phabricator.wikimedia.org/T174916 [16:37:09] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [17:20:35] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:43:55] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.486 second response time https://phabricator.wikimedia.org/T174916 [17:44:50] !log restart pdfrender on scb1002 (alert flapping) [17:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:45] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [18:24:49] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:27:27] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:11:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:14:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:03:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:13:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:04:35] PROBLEM - puppet last run on db1121 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:01] RECOVERY - puppet last run on db1121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:41:49] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Andrew)