[00:05:38] !log LDAP - adding verenali to wmde and nda groups, to match raja_wmde (T233807, T231677) [00:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:44] T233807: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 [00:05:44] T231677: Superset + Turnilo access for Verena Lindner + Raja Gumienny (WMDE) - https://phabricator.wikimedia.org/T231677 [00:08:49] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Dzahn) @RStallman-legalteam Thank you! Done. @Verena Done. You should now be able to login on both http://turnilo.wikimedia.org and http://s... [00:09:49] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Dzahn) 05Open→03Resolved If there are any problems with the logins just ping me or reopen the ticket. [00:10:28] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10Dzahn) [00:10:30] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Dzahn) [00:17:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, and 2 others: Analytics Access for Grant (groups cn=wmf and analytics-privatedata-users) - https://phabricator.wikimedia.org/T235260 (10Dzahn) >>! In T235260#5587341, @Nuria wrote: > Once that patch is merged, you can try > > https:/... [00:18:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, and 2 others: Analytics Access for Grant (groups cn=wmf and analytics-privatedata-users) - https://phabricator.wikimedia.org/T235260 (10Nuria) oohhh, nice thank you! @gsingers please try https://turnilo.wikimedia.org and let us know w... [00:20:36] (03PS1) 10Papaul: DNS: Remove production and mgmt DNS for frav1001 [dns] - 10https://gerrit.wikimedia.org/r/544279 [01:39:35] 10Operations, 10cloud-services-team: Failing puppet runs on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T235819 (10crusnov) p:05Triage→03Normal [08:41:25] !log add user papaul to fasw-c-eqiad [08:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:49] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:01] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:03] (03PS2) 10Volans: devices: refactor signature [software/homer] - 10https://gerrit.wikimedia.org/r/543889 [16:25:05] (03PS2) 10Volans: netbox: allow to select the devices from Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) [16:25:07] (03PS2) 10Volans: setup.py: remove unused test dependency [software/homer] - 10https://gerrit.wikimedia.org/r/543965 [16:25:45] (03CR) 10Volans: "reply inline, addressed comment" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/543889 (owner: 10Volans) [16:29:01] (03CR) 10Volans: [C: 03+2] devices: fix behaviour according to docstring [software/homer] - 10https://gerrit.wikimedia.org/r/543886 (owner: 10Volans) [16:29:13] (03CR) 10Volans: [C: 03+2] typing: use Mapping instead of Dict for arguments [software/homer] - 10https://gerrit.wikimedia.org/r/543887 (owner: 10Volans) [16:29:21] (03CR) 10Volans: [C: 03+2] tests: increase coverage for transports.junos [software/homer] - 10https://gerrit.wikimedia.org/r/543888 (owner: 10Volans) [16:32:26] (03Merged) 10jenkins-bot: devices: fix behaviour according to docstring [software/homer] - 10https://gerrit.wikimedia.org/r/543886 (owner: 10Volans) [16:32:28] (03Merged) 10jenkins-bot: typing: use Mapping instead of Dict for arguments [software/homer] - 10https://gerrit.wikimedia.org/r/543887 (owner: 10Volans) [16:32:30] (03Merged) 10jenkins-bot: tests: increase coverage for transports.junos [software/homer] - 10https://gerrit.wikimedia.org/r/543888 (owner: 10Volans) [16:51:34] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374 (10czar) Better to open or refile? I'm getting the same error: >Error deleting file: An unknown er... [21:20:23] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374 (10Krinkle) Please create a new task :) [21:29:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:29:56] I get error Error: 503, Backend fetch failed at Sat, 19 Oct 2019 21:29:39 GMT frequently [21:29:59] already known? [21:30:11] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:30:17] Sagan: I don't see that in backlog [21:30:25] I seem to especially be getting it on load.php [21:30:29] 10Operations: Error 503 - https://phabricator.wikimedia.org/T235949 (10RhinosF1) [21:30:47] 10Operations: Error 503 - https://phabricator.wikimedia.org/T235949 (10RhinosF1) [21:30:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:31:22] 10Operations: Error 503 - https://phabricator.wikimedia.org/T235949 (10RhinosF1) [21:31:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:31:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [21:31:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:31:43] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:31:43] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:31:44] hm [21:31:47] this sounds bad [21:31:49] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:31:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:31:56] indeed [21:32:06] 10Operations: Error 503 - https://phabricator.wikimedia.org/T235949 (10RhinosF1) p:05Triage→03High Back but slow, can someone have a look? [21:32:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:32:49] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:33:19] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:33:26] 10Operations: Error 503 and slow loading on enwiki - https://phabricator.wikimedia.org/T235949 (10RhinosF1) [21:33:56] Just got a 503 on Commons [21:34:11] 10Operations: Error 503 and slow loading on enwiki - https://phabricator.wikimedia.org/T235949 (10Vachovec1) The same on cs-wiki. [21:34:23] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:35:11] (03PS1) 10Jon Harald Søby: Add Mon (mnw) language [dns] - 10https://gerrit.wikimedia.org/r/544325 (https://phabricator.wikimedia.org/T235739) [21:35:24] 10Operations: Error 503 and slow loading on enwiki - https://phabricator.wikimedia.org/T235949 (10Pine) Same when trying to log into EN. Also, I got this error when logging out of Meta, and when I loaded pages on a few wikis the page content was abnormal. [21:35:51] 10Operations: Error 503 and slow loading on enwiki - https://phabricator.wikimedia.org/T235949 (10Krinkle) ` Status Code: 503 x-cache: cp1087 int, cp3033 miss, cp3030 pass Status Code: 200 x-cache: cp1087 pass, cp3032 pass, cp3030 pass ` Looks like it might be an app server or LVS issue given the same cp1087... [21:36:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:36:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:36:33] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:36:38] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10RhinosF1) [21:36:39] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:36:39] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:36:55] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10Pine) p:05High→03Unbreak! Boldly changing priority to UBN. Feel free to revert if that seems problematic. [21:36:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:37:16] It seems like there was an increase in memcached errors [21:37:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:37:44] These few-minute spikes of 503s seem to be happening frequently ish [21:37:57] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:38:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:38:09] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:38:09] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:38:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:38:26] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1 - mc1029 in particular bawolff ? [21:39:25] If this is happening at regular intervals, perhaps a super hot key that expires? [21:41:34] I don't even have access to logstash anymore, so i really don't know [21:45:51] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=All&from=1571520300000&to=1571521200000 that bottom right graph, mc1029 hit 1 Gbps out [21:47:02] timing does make it seem very suspect yeah [21:47:40] Also looks like there was a bit of a general spike in traffic though at the same time [21:48:20] maybe but it all seemingly being directed towards one memcached? [21:56:21] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10Pine) Logins and page loads now appear normal for me. @RhinosF1 and @Vachovec1 how about for you? [22:02:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 53.44 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:02:29] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10RhinosF1) Seems like it's resolved based on http://wm-bot.wmflabs.org/browser/index.php?start=10%2F19%2F2019&end=10%2F19%2F2019&display=%23wikimedia-operations but I'll leave this open as @Krenair and @Bawol... [22:02:40] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10RhinosF1) p:05Unbreak!→03Normal [22:03:06] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/544325 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [22:03:13] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10Pine) @RhinosF1 sounds good. [22:04:52] https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=6&fullscreen&orgId=1 is interesting... traffic increase, then traffic _drop_ (out of the ordinary)? [22:06:55] if its comparing changes that would be normal for a sudden increase followed by going back to normal, I think [22:07:16] since the now normal, is comparing to past where it was abnormally high [22:07:53] the title of the graph makes me think this is expected [22:08:00] "now-30m text GET % diff" [22:08:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 78.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:08:43] yeah if the current traffic is normal and it was unusually high half an hour ago, this graph will go down [22:08:56] makes sense [22:09:47] also compare the shapes :P [22:17:04] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10Bawolff) >>! In T235949#5588941, @RhinosF1 wrote: > Seems like it's resolved based on http://wm-bot.wmflabs.org/browser/index.php?start=10%2F19%2F2019&end=10%2F19%2F2019&display=%23wikimedia-operations but I... [22:21:04] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10RhinosF1) Sounds good @Bawolff, have a good weekend all and thanks for the quick response. [22:30:16] 10Operations: Error 503 and slow loading on multiple wikis - https://phabricator.wikimedia.org/T235949 (10Masumrezarock100) The same was happening on Commons. Looks like it is resolved now.