[00:06:30] Heads-up: Doing a Friday deploy for T224319 UBN. :-( [00:06:33] T224319: [regression] Editor is read-only after recovering autosaved changes or switching VE to NWE - https://phabricator.wikimedia.org/T224319 [00:13:23] That code is so much fun. [00:26:15] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [00:29:42] Krenair: Indeed. :-( [00:30:20] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.6/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTarget.js: Hot-deploy T224319 for VisualEditor switching and auto-restore (duration: 00m 50s) [00:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:26] T224319: [regression] Editor is read-only after recovering autosaved changes or switching VE to NWE - https://phabricator.wikimedia.org/T224319 [00:33:15] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:39:49] (03CR) 10Jforrester: Even more invariant config moved over to CommonSettings (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [00:42:25] (03CR) 10Jforrester: "> Yeah, I assumed the Zero sunset was much further along. Let's put this on ice for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512442 (https://phabricator.wikimedia.org/T187716) (owner: 10Mholloway) [00:53:09] PROBLEM - pdfrender on scb2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [01:03:01] RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 8.135 second response time https://phabricator.wikimedia.org/T174916 [01:18:35] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:18:51] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:19:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:19:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:19:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:19:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:21:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:21:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:21:23] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [01:21:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [01:22:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:22:49] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:23:05] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [01:23:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:23:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:23:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:24:02] (03PS4) 10Jforrester: De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) [01:24:04] (03PS1) 10Jforrester: Wikibase: Add forwards-compatibility for dataCdnMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512459 [01:24:06] (03PS1) 10Jforrester: Wikibase: Drop backwards-compatibility for dataSquidMaxage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 [01:25:01] (03PS2) 10Jforrester: Wikibase: Drop backwards-compatibility for dataSquidMaxage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 [01:25:19] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [01:25:22] (03CR) 10Jforrester: [C: 04-2] "Not until wmf.7 is everywhere and we're not going back." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [01:27:01] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [01:28:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:28:29] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [01:29:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:33:49] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:26:03] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [02:33:05] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:57:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:04:05] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:09:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:15:27] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:21:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:22:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:25:17] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [03:33:43] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:53:21] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 142.4 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [04:13:43] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:15:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:29:19] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:31:25] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:32:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:32:49] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:39:07] 10Operations, 10Dumps-Generation: Reboot dumps/snapshot hosts - https://phabricator.wikimedia.org/T223962 (10ArielGlenn) [04:42:43] 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Dzahn) [04:53:59] (03PS1) 10ArielGlenn: add new namespaces param only once for abstractsdump [dumps] - 10https://gerrit.wikimedia.org/r/512463 [04:55:01] (03CR) 10ArielGlenn: [C: 03+2] add new namespaces param only once for abstractsdump [dumps] - 10https://gerrit.wikimedia.org/r/512463 (owner: 10ArielGlenn) [04:56:05] !log ariel@deploy1001 Started deploy [dumps/dumps@61114e0]: add namespaces param only once for abstracts with lang variants [04:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:13] !log ariel@deploy1001 Finished deploy [dumps/dumps@61114e0]: add namespaces param only once for abstracts with lang variants (duration: 00m 07s) [04:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Nuria) Approved, thanks. [05:44:08] (03CR) 10Nuria: [C: 03+1] admins: remove expired contractor account of juliaglen (merge on May 31) [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [05:45:21] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10Nuria) Logstash and hadoop are two different systems , do you need access to both? [06:13:31] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:21:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 106, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:51] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [06:29:21] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:31:07] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:38:11] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active [06:45:55] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:56:21] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:58:11] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:01:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:01:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:08:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:08:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:09:56] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10DannyS712) @CDanis just got another one trying to save on meta wiki: ... `via cp1081 cp1081, Varnish XID 295561815 Error:... [08:40:21] 10Operations, 10SRE-Access-Requests, 10observability: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Aklapper) @Tonycepo: That is still too vague... Which specific data would you like to access, for which specific actions? If you only generally "want to know more abou... [08:55:08] (03CR) 10Framawiki: [C: 03+1] Test spaces in wgMetaNamespace(Talk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) (owner: 10Urbanecm) [09:20:31] !log decommission restbase1011-b - T223976 [09:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:38] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [11:39:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:39:42] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 13.473 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:41:43] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor1003.eqiad.wmnet, thumbor1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:42:59] taking a look [11:43:25] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:44:31] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:47:33] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [11:48:06] <_joe_> !log restarted tumbor-instances on thumbor1001 [11:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:23] (03PS1) 10Alexandros Kosiaris: Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 [12:01:39] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [12:02:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 (owner: 10Alexandros Kosiaris) [12:02:20] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 366 bytes in 9.997 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:03:02] (03PS2) 10Alexandros Kosiaris: Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 [12:03:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:05:42] (03PS3) 10Alexandros Kosiaris: Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 [12:06:36] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:07:39] (03PS4) 10Alexandros Kosiaris: Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 [12:07:57] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 366 bytes in 4.427 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:08:33] (03CR) 10Filippo Giunchedi: [C: 03+1] Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 (owner: 10Alexandros Kosiaris) [12:09:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] Temporarily rate limit specific IP [puppet] - 10https://gerrit.wikimedia.org/r/512475 (owner: 10Alexandros Kosiaris) [12:13:01] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:14:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:17:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:17:54] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:18:23] hey [12:21:33] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:21:55] !log bounce thumbor on thumbor1002 [12:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:17] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [12:24:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:25:04] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 366 bytes in 9.809 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:26:11] hello arturo, if you want to follow along we are in the usual place [12:27:15] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:28:33] !log bounce thumbor on thumbor1002 [12:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:32:49] (03PS1) 10Giuseppe Lavagetto: cache::upload: temporarily prevent abuses [puppet] - 10https://gerrit.wikimedia.org/r/512477 [12:32:52] (03PS1) 10Urbanecm: Enable transwiki import between sqwiki and sqwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512478 (https://phabricator.wikimedia.org/T221234) [12:34:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:34:53] (03PS1) 10Urbanecm: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512479 (https://phabricator.wikimedia.org/T224337) [12:35:33] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [12:36:26] (03Abandoned) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463996 (https://phabricator.wikimedia.org/T205995) (owner: 10Urbanecm) [12:36:49] (03PS1) 10Alexandros Kosiaris: Revert "Temporarily rate limit specific IP" [puppet] - 10https://gerrit.wikimedia.org/r/512480 [12:37:17] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Temporarily rate limit specific IP" [puppet] - 10https://gerrit.wikimedia.org/r/512480 (owner: 10Alexandros Kosiaris) [12:37:31] (03PS1) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512481 (https://phabricator.wikimedia.org/T205995) [12:38:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] cache::upload: temporarily prevent abuses [puppet] - 10https://gerrit.wikimedia.org/r/512477 (owner: 10Giuseppe Lavagetto) [12:38:17] (03PS2) 10Alexandros Kosiaris: cache::upload: temporarily prevent abuses [puppet] - 10https://gerrit.wikimedia.org/r/512477 (owner: 10Giuseppe Lavagetto) [12:38:19] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512481 (https://phabricator.wikimedia.org/T205995) (owner: 10Urbanecm) [12:38:27] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] cache::upload: temporarily prevent abuses [puppet] - 10https://gerrit.wikimedia.org/r/512477 (owner: 10Giuseppe Lavagetto) [12:41:09] (03Abandoned) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512481 (https://phabricator.wikimedia.org/T205995) (owner: 10Urbanecm) [12:43:01] (03PS1) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512482 (https://phabricator.wikimedia.org/T205995) [12:43:48] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512482 (https://phabricator.wikimedia.org/T205995) (owner: 10Urbanecm) [12:46:36] (03PS1) 10Alexandros Kosiaris: Amend HTTP code and message for abusive user [puppet] - 10https://gerrit.wikimedia.org/r/512483 [12:46:52] (03PS1) 10BBlack: Provide contact info w/ 403 to 224px abuse case [puppet] - 10https://gerrit.wikimedia.org/r/512484 [12:47:04] (03Abandoned) 10Urbanecm: [DNM] Test if tests are working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512482 (https://phabricator.wikimedia.org/T205995) (owner: 10Urbanecm) [12:47:52] (03Abandoned) 10Alexandros Kosiaris: Amend HTTP code and message for abusive user [puppet] - 10https://gerrit.wikimedia.org/r/512483 (owner: 10Alexandros Kosiaris) [12:48:54] (03PS2) 10BBlack: Provide contact info w/ 403 to 224px abuse case [puppet] - 10https://gerrit.wikimedia.org/r/512484 [12:50:20] (03PS1) 10Filippo Giunchedi: secret: add thumbor ban lists [labs/private] - 10https://gerrit.wikimedia.org/r/512485 [12:50:51] (03PS1) 10Filippo Giunchedi: thumbor: add ban lists on client ip and path re [puppet] - 10https://gerrit.wikimedia.org/r/512486 [12:51:03] (03CR) 10BBlack: [C: 03+2] Provide contact info w/ 403 to 224px abuse case [puppet] - 10https://gerrit.wikimedia.org/r/512484 (owner: 10BBlack) [12:51:22] (03CR) 10jerkins-bot: [V: 04-1] thumbor: add ban lists on client ip and path re [puppet] - 10https://gerrit.wikimedia.org/r/512486 (owner: 10Filippo Giunchedi) [12:53:29] (03CR) 10Filippo Giunchedi: "See also Icef85709" [labs/private] - 10https://gerrit.wikimedia.org/r/512485 (owner: 10Filippo Giunchedi) [12:53:47] (03CR) 10Filippo Giunchedi: "See also I907a980c" [puppet] - 10https://gerrit.wikimedia.org/r/512486 (owner: 10Filippo Giunchedi) [12:55:04] (03PS2) 10Filippo Giunchedi: thumbor: add ban lists on client ip and path re [puppet] - 10https://gerrit.wikimedia.org/r/512486 [13:32:26] (03PS1) 10Urbanecm: Fix Serbian projects' wgRestrictionLevels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512487 (https://phabricator.wikimedia.org/T217005) [13:46:58] !log decommissioning restbase1011-b -- T223976 [13:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:03] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [13:50:42] (03PS1) 10Urbanecm: Remove bureaucrat protection level for all Serbian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512488 (https://phabricator.wikimedia.org/T217005) [14:00:48] (03CR) 10Zoranzoki21: [C: 03+1] "Should be ok, Martin knows better :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512487 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [14:02:14] (03CR) 10Zoranzoki21: [C: 03+1] Remove bureaucrat protection level for all Serbian projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512488 (https://phabricator.wikimedia.org/T217005) (owner: 10Urbanecm) [17:00:15] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [17:15:33] (03PS1) 10Andrew Bogott: cloudcontrol1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512492 (https://phabricator.wikimedia.org/T221770) [17:16:15] (03CR) 10jerkins-bot: [V: 04-1] cloudcontrol1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512492 (https://phabricator.wikimedia.org/T221770) (owner: 10Andrew Bogott) [17:17:46] (03PS2) 10Andrew Bogott: cloudcontrol1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512492 (https://phabricator.wikimedia.org/T221770) [17:18:33] (03PS3) 10Andrew Bogott: cloudcontrol1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512492 (https://phabricator.wikimedia.org/T221770) [17:19:12] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrol1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512492 (https://phabricator.wikimedia.org/T221770) (owner: 10Andrew Bogott) [17:27:15] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:18:33] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:30:43] 10Operations, 10Wikimedia-Site-requests: Add another bad word to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) [18:30:57] that paged [18:33:22] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott this is me rebuilding cloudcontrol1004 [18:40:53] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) [18:42:49] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) [18:49:06] (03PS1) 10Andrew Bogott: cloudcontrol: temporarily mark out prometheus classes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512493 (https://phabricator.wikimedia.org/T224345) [18:49:39] (03CR) 10jerkins-bot: [V: 04-1] cloudcontrol: temporarily mark out prometheus classes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512493 (https://phabricator.wikimedia.org/T224345) (owner: 10Andrew Bogott) [18:50:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:50:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:50:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:50:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:50:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:53:05] (03PS2) 10Andrew Bogott: cloudcontrol: temporarily mark out prometheus classes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512493 (https://phabricator.wikimedia.org/T224345) [18:53:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:54:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:54:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:54:51] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrol: temporarily mark out prometheus classes on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512493 (https://phabricator.wikimedia.org/T224345) (owner: 10Andrew Bogott) [18:57:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:57:15] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:11:24] (03PS1) 10Andrew Bogott: cloudservices1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512494 (https://phabricator.wikimedia.org/T221769) [19:12:17] !log reimaging cloudservices1004 with Stretch [19:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:22] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices1004: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/512494 (https://phabricator.wikimedia.org/T221769) (owner: 10Andrew Bogott) [19:16:15] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [19:42:51] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:44:44] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott Im looking at this [20:11:04] (03PS1) 10Jbond: varnish: ratlimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [20:15:03] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [20:16:15] (03PS2) 10Jbond: varnish: ratlimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [20:20:19] (03PS3) 10Jbond: varnish: ratlimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [20:26:33] (03PS4) 10Jbond: varnish: ratlimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [20:29:23] (03PS2) 10Urbanecm: Change arwiki's default user preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501926 (https://phabricator.wikimedia.org/T220186) [20:32:09] I like ratlimit but I imagine that's a typo... [20:34:07] some of the common sizes are listen underneath most images on commons, i.e. https://commons.wikimedia.org/wiki/File:D%C3%BClmen,_Merfeld,_D%C3%BClmener_Wildpferde_in_der_Wildbahn_--_2016_--_4201.jpg [20:34:11] as an example [20:34:51] https://commons.wikimedia.org/wiki/File:APC_Catch_SuperClash_II_6.jpg second example [20:34:59] and then there are icons and such as well... [20:35:12] I'm out, off to bed [20:41:37] thanks apergos [20:41:42] and good night :) [20:47:00] (03PS5) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [20:51:53] (03PS6) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [20:52:13] (03PS1) 10Andrew Bogott: Revert "cloudservices1004: move to Stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512497 (https://phabricator.wikimedia.org/T221769) [20:52:58] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloudservices1004: move to Stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512497 (https://phabricator.wikimedia.org/T221769) (owner: 10Andrew Bogott) [20:58:15] (03PS7) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [21:01:53] (03PS8) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 [21:08:19] PROBLEM - Host 208.80.154.24 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:41] andrewbogott ^^ [21:13:44] that's cloud-recursor1.wikimedia.org. [21:13:53] hm, downtime must've expired [21:16:17] (03PS1) 10Andrew Bogott: apparently we can't actually build jessie hosts anymore :( [puppet] - 10https://gerrit.wikimedia.org/r/512501 [21:17:14] (03CR) 10Andrew Bogott: [C: 03+2] apparently we can't actually build jessie hosts anymore :( [puppet] - 10https://gerrit.wikimedia.org/r/512501 (owner: 10Andrew Bogott) [21:19:13] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) (owner: 10Acamicamacaraca) [21:23:30] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Sphilbrick) {F29247966} I would have guessed that "nazi" on the list would pick up "nazis" but the attached screenshot suggests otherwise (if that works, i'never uploaded an image... [21:28:42] (03CR) 10Krinkle: [C: 03+2] deployment-prep: Use new cxserver running in Docker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510586 (https://phabricator.wikimedia.org/T220235) (owner: 10Alex Monk) [21:29:54] (03Merged) 10jenkins-bot: deployment-prep: Use new cxserver running in Docker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510586 (https://phabricator.wikimedia.org/T220235) (owner: 10Alex Monk) [21:30:13] (03CR) 10jenkins-bot: deployment-prep: Use new cxserver running in Docker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510586 (https://phabricator.wikimedia.org/T220235) (owner: 10Alex Monk) [21:30:52] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) Cheers for the screenshot. Nazi isn't on the list, hence the request to have it added [21:41:05] RECOVERY - Host 208.80.154.24 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [21:46:15] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:49:10] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Platonides) nazi is on the blacklist present in the repository since 2014, see 1e5bd7dc3c1be Seems we are using a different blacklist for captcha generation, but I see no need to... [21:50:02] * Krinkle staging on mwdebug1002 [21:50:36] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [21:51:27] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) >>! In T224343#5212671, @Platonides wrote: > nazi is on the blacklist present in the repository since 2014, see 1e5bd7dc3c1be Might be in that one, it's not in the list WM... [21:53:44] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:54:30] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) >>! In T224343#5212671, @Platonides wrote: > Seems we are using a different blacklist for captcha generation, but I see no need to do so. Probably not. But the one in WMF... [21:56:25] (03CR) 10Krinkle: "Is there a related task?" [puppet] - 10https://gerrit.wikimedia.org/r/512495 (owner: 10Jbond) [21:58:36] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Platonides) The repo has a tiny blocklist (510 bytes). A 5KB blocklist seems perfectly acceptable to store there, and a good blocklist would benefit everyone. Do we know the sourc... [22:00:05] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.6/includes/Linker.php: T222628 / c735a545df3a (duration: 00m 51s) [22:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:13] T222628: Some history views and diffs unavailable on Wikipedias (Fatal ParameterAssertionException: Bad value for parameter $dbkey) - https://phabricator.wikimedia.org/T222628 [22:00:38] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [22:01:28] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:02:32] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76031 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:04:24] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:06:21] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Reedy) >>! In T224343#5212690, @Platonides wrote: > The repo has a tiny blocklist (510 bytes). A 5KB blocklist seems perfectly acceptable to store there, and a good blocklist woul... [22:15:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Nuria) Approved, yes. Many thanks. [22:16:40] 10Operations, 10Wikimedia-Site-requests: Add more bad words to fancycaptcha/badwords - https://phabricator.wikimedia.org/T224343 (10Platonides) The actual wordlist was published by Tim many years ago,¹ which is much more relevant for an attacker than the blacklist. Someone even created a greasemonkey to check... [22:20:38] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [22:23:48] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:29:52] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [22:30:44] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [22:34:28] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:41:10] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [22:41:41] !log decommissioning restbase1011-c -- T223976 [22:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:47] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [22:56:40] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:14:40] PROBLEM - Recursive DNS on 208.80.154.24 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [23:35:38] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:41:14] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [23:45:32] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:51:06] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [23:56:42] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:57:02] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 101.8 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen