[01:46:07] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:47:55] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 38 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:27:43] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1010 is CRITICAL: 4.335e+04 ge 3600 Gehel test server, dont care at this time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:29:11] !log restarting blazegraph on wdqs100[57] [03:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:31] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:49] 10Operations, 10Design-Research, 10Domains, 10Traffic: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Dendelele) >>! In T241944#5779323, @Dzahn wrote: > @Dendelele Please ask OIT to get you office wiki access and a contractor email address. They will help you with... [05:11:53] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia trouble ticket 01103058 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:53] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia trouble ticket 01103058 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:19] (03PS1) 10Legoktm: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) [06:18:29] (03CR) 10jerkins-bot: [V: 04-1] Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [07:37:10] (03CR) 10ArielGlenn: [C: 03+1] "Seems fine to me, do you need ayone else to sign off? Should I just merge it through?" [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) (owner: 10Joal) [08:50:18] (03CR) 10Jdlrobson: "> Yes, I know, and in fact I was there for the first two. I missed the third one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [08:58:11] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Security-Team, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10Bawolff) >>! In T158604#5775825, @Tgr wrote: > Also, if I read the spec correctly (and this... [09:31:27] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [09:33:17] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [10:58:37] 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Michael) This seems to have been made worse by {T243725}. The patches for that Wikiba... [11:11:47] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:19:01] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:24:08] 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10jcrespo) p:05Unbreak!→03High [11:50:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:51:03] * effie looking [11:51:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [11:51:34] happy all-hands y'all [11:52:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:53:16] ok so it was parsoid, I am checking if there is an open task [11:57:21] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [11:59:49] 10Operations, 10Performance-Team: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10jcrespo) [12:00:59] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [12:07:13] hello, what did I miss? [12:10:20] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --from-id 1860 --to-id 1860 (T243705) [12:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:24] T243705: Lexemes: in English, "unknown language" appears instead of "English" in search, and "Q1860" appears instead of "English" on lexeme pages - https://phabricator.wikimedia.org/T243705 [12:44:48] apergos: not much [12:45:04] apergos: something is lightly misbehaving in parsoid [12:45:30] any idea what? I guess harder to know since it recovered [12:47:36] apergos: yes, I am opening a task now [12:47:49] ok [13:32:25] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [13:35:19] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [13:36:40] !log restarting varnish-fe on cp4029 - T243634 [13:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:43] T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 [13:38:22] (03PS1) 10Marostegui: db1125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567448 [13:39:27] (03CR) 10Marostegui: [C: 03+2] db1125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567448 (owner: 10Marostegui) [13:54:15] !log repooling cp4029 - T243634 [13:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:18] T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 [13:54:39] !log restarting varnish-fe on cp4030 - T243634 [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:46] (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce delegation for codfw1dev.wikimedia.cloud [dns] - 10https://gerrit.wikimedia.org/r/567453 (https://phabricator.wikimedia.org/T243556) [13:56:03] (03PS1) 10Arturo Borrero Gonzalez: wmcs: codfw1dev: fix .cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/567454 (https://phabricator.wikimedia.org/T243556) [13:58:25] !log repooling cp4030 - T243634 [13:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] (03CR) 10Zoranzoki21: "> Hey, this looks good to go. Were you going to schedule it for a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) (owner: 10Zoranzoki21) [14:44:36] (03PS4) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) [14:44:41] (03PS5) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) [14:45:23] (03PS6) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) [14:46:51] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:48:49] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [14:49:38] ^ checking [14:49:59] Hello, how to rebase patch in terminal? [14:50:37] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [14:52:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:52:28] Because when I do it, bad things happen [14:52:48] It wants to upload patch set on merged patch [15:29:55] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:33] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:49] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Dvorapa) Today it is around 8-10s the whole day [15:46:19] (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce delegation for codfw1dev.wikimedia.cloud [dns] - 10https://gerrit.wikimedia.org/r/567453 (https://phabricator.wikimedia.org/T243556) [15:48:36] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [15:49:43] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [15:49:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce delegation for codfw1dev.wikimedia.cloud [dns] - 10https://gerrit.wikimedia.org/r/567453 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez) [15:53:30] (03PS13) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [15:54:09] (03CR) 10Thcipriani: [C: 03+1] contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [15:59:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: codfw1dev: fix .cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/567454 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez) [16:01:32] (03CR) 10Ammarpad: "Thank you. Please do schedule it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [16:03:06] 10Operations, 10MediaWiki-API, 10Traffic, 10Wikidata, and 2 others: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Addshore) Not sure if someone from #operations or #traffic might be able to enlighten us on how easy this would be to fix? [16:04:58] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) @godog I replaced the disk, please see what you need to do to add it back to the raid. Thanks! [16:08:27] RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:19:15] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10jcrespo) I am not part of the Wikidata QS team, so I don't have answers, just questions :-D Only chiming in because my team was been... [16:30:28] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Addshore) [16:31:13] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore) [16:32:24] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) >>! In T243701#5834698, @jcrespo wrote: > You mention: >> Pywikibot test environments and CIs... [16:38:38] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore) >>! In T243701#5834698, @jcrespo wrote: >> Seconds or even a minute or two lag seems acceptab... [16:40:40] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) (please keep in mind I don't understand the differences, all I understand is that API lag is o... [16:42:04] (03PS1) 10Cmjohnson: updating dhcp file for mc-gp1001-1003 [puppet] - 10https://gerrit.wikimedia.org/r/567476 (https://phabricator.wikimedia.org/T241795) [16:43:39] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10jcrespo) > API maxlag has been repeatedly declared to always be <5s by several people in the past That... [16:44:44] (03CR) 10Cmjohnson: [C: 03+2] updating dhcp file for mc-gp1001-1003 [puppet] - 10https://gerrit.wikimedia.org/r/567476 (https://phabricator.wikimedia.org/T241795) (owner: 10Cmjohnson) [16:45:21] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) > What are these tests doing? They are obviously testing Pywikibot functions against several w... [16:46:03] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) The mac addresses were out of order...updated the file and attempting to run the installer [16:46:18] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [16:46:35] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) the OS has been installed on 4 of these, initial puppet run has not been done [16:52:27] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) > For the specific issue you are facing, I may be suggest to review SLA expectations about the... [16:59:43] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lucas_Werkmeister_WMDE) I wonder if it would make sense to ignore query service lag on GET requests? Th... [17:00:03] RECOVERY - Device not healthy -SMART- on ms-be1039 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1039&var-datasource=eqiad+prometheus/ops [17:14:16] (03PS1) 10Arturo Borrero Gonzalez: openstack: puppet: encapi: not declare python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/567485 [17:15:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: puppet: encapi: not declare python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/567485 (owner: 10Arturo Borrero Gonzalez) [18:12:05] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:45] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [18:37:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:33] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [19:15:14] !log Remove partitions from db2085 enwiki - T239453 [19:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:18] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [19:16:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3311 - T239453', diff saved to https://phabricator.wikimedia.org/P10277 and previous config saved to /var/cache/conftool/dbconfig/20200127-191614-marostegui.json [19:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:27] (03PS2) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [19:21:54] (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [19:22:36] ffs flake8. fine i'll add the exclusion to the top level tox.ini >_< [19:23:48] (03PS3) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [19:24:08] (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [19:25:26] (03PS4) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [21:19:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:48:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:53:31] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:54:01] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:54:41] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:55:21] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:55:21] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:55:27] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:56:27] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:57:12] waow [21:57:14] what fixed it? [21:58:55] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:59:29] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:00:27] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:00:43] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:01:13] that's a huge increase in response times for the api [22:01:28] !log restarting varnish-fe on cp4028 [22:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:01] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:03:49] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:04:21] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:05:01] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:05:35] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:07:23] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:07:58] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:09:53] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:12:55] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:13:33] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:09] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [22:14:11] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:19] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:19] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:31] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:45] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:15:21] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:15:27] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:27] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:35] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:35] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:43] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:53] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:53] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:57] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:16:01] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:16:05] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:16:07] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:11] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:13] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:21] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:27] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:27] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:29] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:45] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:17:53] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:19:17] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:19:39] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:20:03] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [22:20:15] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:11] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:21] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:35] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:41] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:41] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:21:53] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [22:22:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:22:45] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:22:53] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:22:53] (03PS1) 10Vgutierrez: varnish: Use ats-tls instead of nginx on varnish-frontend-restart [puppet] - 10https://gerrit.wikimedia.org/r/567514 (https://phabricator.wikimedia.org/T236120) [22:23:11] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:11] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:23] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:27] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:29] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:43] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:24:45] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:25:20] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:26:43] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:27:57] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:29:27] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:30:11] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:30:25] (03PS1) 10Ema: vcl: stricter rate limiting for node-fetch [puppet] - 10https://gerrit.wikimedia.org/r/567515 [22:30:41] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [22:31:55] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:31:59] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:31:59] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:32:01] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:33:05] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:33:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] vcl: stricter rate limiting for node-fetch [puppet] - 10https://gerrit.wikimedia.org/r/567515 (owner: 10Ema) [22:33:43] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:34:19] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:34:36] (03CR) 10Vgutierrez: [C: 03+2] vcl: stricter rate limiting for node-fetch [puppet] - 10https://gerrit.wikimedia.org/r/567515 (owner: 10Ema) [22:35:45] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:39:39] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [22:41:07] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:41:20] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1 [22:41:25] shows no thread issues. [22:42:29] Any SRE around for ^? :) [22:42:45] We're looking into it [22:42:57] thanks [22:43:35] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [22:44:32] that's not good :( [22:44:41] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 864 bytes in 2.062 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:45:05] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27384 bytes in 5.118 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [22:45:17] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36940 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Phabricator [22:46:37] pages... [22:47:22] got the phab cricitcal and recovery together [22:52:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [22:53:59] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:55:37] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.178 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:56:08] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27384 bytes in 6.753 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [22:58:05] !log restarting gerrit service [22:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:21] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [23:01:41] <_joe_> !log restart apache on gerrit [23:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:12] (03CR) 10Ema: [C: 03+1] varnish: Use ats-tls instead of nginx on varnish-frontend-restart [puppet] - 10https://gerrit.wikimedia.org/r/567514 (https://phabricator.wikimedia.org/T236120) (owner: 10Vgutierrez) [23:02:52] (03CR) 10Vgutierrez: [C: 03+2] varnish: Use ats-tls instead of nginx on varnish-frontend-restart [puppet] - 10https://gerrit.wikimedia.org/r/567514 (https://phabricator.wikimedia.org/T236120) (owner: 10Vgutierrez) [23:06:47] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [23:06:47] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:06:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:58] 10Operations, 10Gerrit: gerritro user getting access denied from dbproxy1007 - https://phabricator.wikimedia.org/T243800 (10Marostegui) [23:10:17] !log rolling restart of varnish-frontend in cp4026 and cp4027 [23:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:37] RECOVERY - Disk space on ms-be1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1039&var-datasource=eqiad+prometheus/ops [23:10:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:10:54] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui) [23:11:28] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui) [23:12:09] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Paladox) I think this needs updating https://github.com/wikimedia/puppet/blob/production/modules/gerrit/manifests/jetty.pp#L44 (unless the pass is the same for the *ro user). [23:13:25] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10fgiunchedi) 05Open→03Resolved >>! In T242511#5834630, @Cmjohnson wrote: > @godog I replaced the disk, please see what you need to do to add it back to the raid. Thanks! Thanks Chris... [23:15:45] (03PS1) 10Ema: traffic-pool.service: replace nginx with ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/567526 (https://phabricator.wikimedia.org/T231627) [23:16:34] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui) The password for the ro user is different from that one (on the pw repo, which is the one I used). I tried the one on https://github.com/wikimedia/puppet/blob/production/modul... [23:18:02] (03PS2) 10Ema: traffic-pool.service: replace nginx with ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/567526 (https://phabricator.wikimedia.org/T231627) [23:18:52] 10Operations, 10LDAP-Access-Requests: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 (10SpramudyaDev) [23:18:54] marostegui couldn't we add the pass here https://github.com/wikimedia/labs-private/tree/master/hieradata/hosts for gerrit1002? [23:24:04] 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Paladox) Couldn't we add the pass here https://github.com/wikimedia/labs-private/tree/master/hieradata/hosts for gerrit1002? [23:32:50] 10Operations, 10LDAP-Access-Requests: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 (10SpramudyaDev) [23:41:30] 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10PoolCounter, 10serviceops: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10Joe) [23:41:44] 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10PoolCounter, 10serviceops: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10Joe) p:05Triage→03High [23:46:50] 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10serviceops, 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10Krinkle) [23:55:57] 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Krinkle) >>! In T243713#5833717, @Michael wrote: > This seems to have been made worse...