[00:25:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:59] (03PS1) 10Zoranzoki21: flaggedrevs.php: Enable autoreview for bots on bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547901 [00:33:17] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:36:11] (03PS2) 10Zoranzoki21: flaggedrevs.php: Enable autoreview for bots on bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547901 (https://phabricator.wikimedia.org/T237170) [00:36:51] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:53] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:56:18] What's up with commons? I'm getting "[Xb4zmwpAAD4AAHQiSJMAAAAI] 2019-11-03 01:55:39: Fatal exception of type "InvalidArgumentException"" when trying to undo an edit [02:00:07] any edit or just a particular one thcipriani [02:00:24] @Krenair https://commons.wikimedia.org/w/index.php?title=File:Tara_Sutaria_at_Sabyasachi_event_in_2019.jpg&diff=373018477&oldid=364761963&diffmode=source [02:00:25] TheSandDoctor* sorry thc.ipriani, hit tab too soon [02:00:44] * TheSandDoctor was just about to file something on Phabricator. [02:00:57] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Update comments on pdns3hack [puppet] - 10https://gerrit.wikimedia.org/r/547874 (owner: 10Alex Monk) [02:02:33] @Krenair https://phabricator.wikimedia.org/T237173 [02:02:37] I subscribed you to it [02:03:30] TheSandDoctor, ack, managed to undo this one manually [02:03:41] you got same thing> [02:03:43] ?* [02:04:11] yes [02:04:19] might want to add that to the ticket? [02:04:25] might be relating to MCR, not sure yet [02:05:54] MCR @Krenair? [02:10:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:07] TheSandDoctor, Multi-Content Revisions, being used for SDC [02:11:29] this is where I sound stupid and ask what SDC stands for.... [02:11:32] :P [02:11:35] @Krenair [02:12:27] https://meta.wikimedia.org/wiki/Structured_Data_on_Commons [02:13:52] thanks [02:37:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:18] !log andrew@deploy1001 Started deploy [horizon/deploy@9972ed2]: deploying fix for puppet prefix creation [03:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:16] !log andrew@deploy1001 Finished deploy [horizon/deploy@9972ed2]: deploying fix for puppet prefix creation (duration: 06m 01s) [03:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:01] !log andrew@deploy1001 Started deploy [horizon/deploy@9972ed2]: deploying fix for puppet prefix creation (second try) [03:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:25] !log andrew@deploy1001 Finished deploy [horizon/deploy@9972ed2]: deploying fix for puppet prefix creation (second try) (duration: 00m 25s) [03:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:54] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) I can't PXE boot, so something is broken somewhere. I haven't dug in much though. ` Broadcom UNDI PXE-2.1 v214.0.170.0 Copyright... [03:50:32] !log andrew@deploy1001 Started deploy [horizon/deploy@0c024d4]: one more prefix fix [03:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:06] !log andrew@deploy1001 Finished deploy [horizon/deploy@0c024d4]: one more prefix fix (duration: 03m 35s) [03:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:23] PROBLEM - Host ms-be2056 is DOWN: PING CRITICAL - Packet loss = 100% [04:34:07] RECOVERY - Host ms-be2056 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [05:00:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [05:01:47] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1696 days) https://wikitech.wikimedia.org/wiki/Logs [06:40:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:19] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:26] (03CR) 10Urbanecm: [C: 04-1] flaggedrevs.php: Enable autoreview for bots on bswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547901 (https://phabricator.wikimedia.org/T237170) (owner: 10Zoranzoki21) [07:50:38] (03CR) 10Masumrezarock100: Add localized Minerva wordmark for Sindhi Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547061 (https://phabricator.wikimedia.org/T200870) (owner: 10Ammarpad) [08:55:25] PROBLEM - Long running screen/tmux on snapshot1005 is CRITICAL: CRIT: Long running SCREEN process. (user: ariel PID: 7930, 1733099s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [09:28:05] (03PS1) 10Brian Wolff: Adjust CSP header for pdfs & videos & set enforce on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) [09:28:53] (03CR) 10Ema: [C: 03+1] cumin: aliases: cache::text_ats is a thing now [puppet] - 10https://gerrit.wikimedia.org/r/547800 (https://phabricator.wikimedia.org/T227432) (owner: 10CDanis) [09:46:31] 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Bawolff) A small number of browsers seem to want android-webview-video-poster: as a source when viewing videos, but the... [10:12:19] 10Operations, 10ContentSecurityPolicy, 10Security-Team, 10Traffic, and 2 others: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Bawolff) [10:28:21] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech [10:28:21] ki/RESTBase [10:29:53] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:39:54] (03PS1) 10MarcoAurelio: Allow FlaggedRevs' 'autoreview' permission to be assigned globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547957 [12:41:06] (03PS2) 10MarcoAurelio: Allow FlaggedRevs' 'autoreview' permission to be assigned globally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547957 [12:45:29] Urbanecm: can you rebase https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/529037/ please? [12:45:56] * Urbanecm tried the web button [12:45:59] sure, will do hauskater :) [12:46:19] Yeah, I tried that too lol [12:46:41] * Urbanecm should've guessed you did :D [12:47:13] (03CR) 10Urbanecm: [C: 03+1] "Sure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547957 (owner: 10MarcoAurelio) [12:48:07] hauskater: should be done! [12:50:27] Urbanecm: thanks, but it looks "group-oathauth-tester-member": "two-factor authentication tester", is added there for some reason? [12:51:05] * Urbanecm was not paying enough attention [12:51:49] hauskater: what about now? [12:52:07] checking [12:52:31] still there? [12:52:45] ehm wait [12:52:48] reloading [12:53:21] Urbanecm: looks good [12:53:24] thanks [12:53:36] hauskater: yw [13:10:37] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:45] (03Abandoned) 10Zoranzoki21: flaggedrevs.php: Enable autoreview for bots on bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547901 (https://phabricator.wikimedia.org/T237170) (owner: 10Zoranzoki21) [14:06:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:25] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [15:00:37] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [15:10:19] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [15:13:31] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [15:26:33] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [15:26:44] 10Operations, 10serviceops: Kubernetes hosts raid check make facter fail - https://phabricator.wikimedia.org/T237197 (10Volans) [15:35:25] 10Operations, 10serviceops: Kubernetes workers frequent oom-killer in action - https://phabricator.wikimedia.org/T237198 (10Volans) [15:40:07] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:52] davidwbarratt: hi. re. react.i18n if you submit https://gerrit.wikimedia.org/r/#/c/react.i18n/+/547894/ jenkins will be able to submit on the repo. [19:12:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:17:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:40:51] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:33] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:17] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:53] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:51:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:29] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:11:09] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) @Andrew the reason is that cloudbackup2002 is in the .16 network or it supposed to be in the .32 network since it is racked in row... [23:57:20] (03PS1) 10Alex Monk: cloud-puppetmaster: Prep for new instances [puppet] - 10https://gerrit.wikimedia.org/r/547992 (https://phabricator.wikimedia.org/T235218)