[01:46:07] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[01:47:55] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 38 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[03:27:43] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs1010 is CRITICAL: 4.335e+04 ge 3600 Gehel test server, dont care at this time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[03:29:11] <gehel>	 !log restarting blazegraph on wdqs100[57]
[03:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:11:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:12:31] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:55:49] <wikibugs>	 10Operations, 10Design-Research, 10Domains, 10Traffic: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Dendelele) >>! In T241944#5779323, @Dzahn wrote: > @Dendelele Please ask OIT to get you office wiki access and a contractor email address. They will help you with...
[05:11:53] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia trouble ticket 01103058 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:11:53] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia trouble ticket 01103058 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:16:19] <wikibugs>	 (03PS1) 10Legoktm: Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056)
[06:18:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add profile and module for for static HTML dump of CodeReview [puppet] - 10https://gerrit.wikimedia.org/r/567407 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm)
[07:37:10] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Seems fine to me, do you need ayone else to sign off? Should I just merge it through?" [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) (owner: 10Joal)
[08:50:18] <wikibugs>	 (03CR) 10Jdlrobson: "> Yes, I know, and in fact I was there for the first two. I missed the third one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad)
[08:58:11] <wikibugs>	 10Operations, 10MediaWiki-Authentication-and-authorization, 10Security-Team, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10Bawolff) >>! In T158604#5775825, @Tgr wrote: > Also, if I read the spec correctly (and this...
[09:31:27] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[09:33:17] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[10:58:37] <wikibugs>	 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Michael) This seems to have been made worse by {T243725}. The patches for that Wikiba...
[11:11:47] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[11:19:01] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[11:24:08] <wikibugs>	 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10jcrespo) p:05Unbreak!→03High
[11:50:55] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:51:03] * effie looking
[11:51:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[11:51:34] <hauskatze>	 happy all-hands y'all
[11:52:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:53:16] <effie>	 ok so it was parsoid, I am checking if there is an open task 
[11:57:21] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[11:59:49] <wikibugs>	 10Operations, 10Performance-Team: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null - https://phabricator.wikimedia.org/T243762 (10jcrespo)
[12:00:59] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[12:07:13] <apergos>	 hello, what did I miss?
[12:10:20] <Amir1>	 !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --from-id 1860 --to-id 1860 (T243705)
[12:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:24] <stashbot>	 T243705: Lexemes: in English, "unknown language" appears instead of "English" in search, and "Q1860" appears instead of "English" on lexeme pages - https://phabricator.wikimedia.org/T243705
[12:44:48] <effie>	 apergos: not much 
[12:45:04] <effie>	 apergos: something is lightly misbehaving in parsoid
[12:45:30] <apergos>	 any idea what? I guess harder to know since it recovered
[12:47:36] <effie>	 apergos: yes, I am opening a task now
[12:47:49] <apergos>	 ok
[13:32:25] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[13:35:19] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[13:36:40] <vgutierrez>	 !log restarting varnish-fe on cp4029 - T243634
[13:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:43] <stashbot>	 T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634
[13:38:22] <wikibugs>	 (03PS1) 10Marostegui: db1125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567448
[13:39:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/567448 (owner: 10Marostegui)
[13:54:15] <vgutierrez>	 !log repooling cp4029 - T243634
[13:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:18] <stashbot>	 T243634: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634
[13:54:39] <vgutierrez>	 !log restarting varnish-fe on cp4030 - T243634
[13:54:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce delegation for codfw1dev.wikimedia.cloud [dns] - 10https://gerrit.wikimedia.org/r/567453 (https://phabricator.wikimedia.org/T243556)
[13:56:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: codfw1dev: fix .cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/567454 (https://phabricator.wikimedia.org/T243556)
[13:58:25] <vgutierrez>	 !log repooling cp4030 - T243634
[13:58:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:05] <wikibugs>	 (03CR) 10Zoranzoki21: "> Hey, this looks good to go. Were you going to schedule it for a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) (owner: 10Zoranzoki21)
[14:44:36] <wikibugs>	 (03PS4) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118)
[14:44:41] <wikibugs>	 (03PS5) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118)
[14:45:23] <wikibugs>	 (03PS6) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118)
[14:46:51] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:48:49] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[14:49:38] <effie>	 ^ checking 
[14:49:59] <Zoranzoki21>	 Hello, how to rebase patch in terminal?
[14:50:37] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[14:52:17] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:52:28] <Zoranzoki21>	 Because when I do it, bad things happen
[14:52:48] <Zoranzoki21>	 It wants to upload patch set on merged patch
[15:29:55] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:33] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:49] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Dvorapa) Today it is around 8-10s the whole day
[15:46:19] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud: introduce delegation for codfw1dev.wikimedia.cloud [dns] - 10https://gerrit.wikimedia.org/r/567453 (https://phabricator.wikimedia.org/T243556)
[15:48:36] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Dvorapa)
[15:49:43] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Dvorapa)
[15:49:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: introduce delegation for codfw1dev.wikimedia.cloud [dns] - 10https://gerrit.wikimedia.org/r/567453 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez)
[15:53:30] <wikibugs>	 (03PS13) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728)
[15:54:09] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[15:59:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: codfw1dev: fix .cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/567454 (https://phabricator.wikimedia.org/T243556) (owner: 10Arturo Borrero Gonzalez)
[16:01:32] <wikibugs>	 (03CR) 10Ammarpad: "Thank you. Please do schedule it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad)
[16:03:06] <wikibugs>	 10Operations, 10MediaWiki-API, 10Traffic, 10Wikidata, and 2 others: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Addshore) Not sure if someone from #operations or #traffic might be able to enlighten us on how easy this would be to fix?
[16:04:58] <wikibugs>	 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) @godog I replaced the disk, please see what you need to do to add it back to the raid.  Thanks!
[16:08:27] <icinga-wm>	 RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[16:19:15] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10jcrespo) I am not part of the Wikidata QS team, so I don't have answers, just questions :-D Only chiming in because my team was been...
[16:30:28] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata queryservice lag repeatedly over 5s since Jan20, 2020 - https://phabricator.wikimedia.org/T243701 (10Addshore)
[16:31:13] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore)
[16:32:24] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) >>! In T243701#5834698, @jcrespo wrote: > You mention: >> Pywikibot test environments and CIs...
[16:38:38] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore) >>! In T243701#5834698, @jcrespo wrote: >> Seconds or even a minute or two lag seems acceptab...
[16:40:40] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) (please keep in mind I don't understand the differences, all I understand is that API lag is o...
[16:42:04] <wikibugs>	 (03PS1) 10Cmjohnson: updating dhcp file for mc-gp1001-1003 [puppet] - 10https://gerrit.wikimedia.org/r/567476 (https://phabricator.wikimedia.org/T241795)
[16:43:39] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10jcrespo) > API maxlag has been repeatedly declared to always be <5s by several people in the past  That...
[16:44:44] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] updating dhcp file for mc-gp1001-1003 [puppet] - 10https://gerrit.wikimedia.org/r/567476 (https://phabricator.wikimedia.org/T241795) (owner: 10Cmjohnson)
[16:45:21] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) > What are these tests doing? They are obviously testing Pywikibot functions against several w...
[16:46:03] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) The mac addresses were out of order...updated the file and attempting to run the installer
[16:46:18] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson)
[16:46:35] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) the OS has been installed on 4 of these, initial puppet run has not been done
[16:52:27] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) > For the specific issue you are facing, I may be suggest to review SLA expectations about the...
[16:59:43] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lucas_Werkmeister_WMDE) I wonder if it would make sense to ignore query service lag on GET requests? Th...
[17:00:03] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be1039 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1039&var-datasource=eqiad+prometheus/ops
[17:14:16] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: puppet: encapi: not declare python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/567485
[17:15:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: puppet: encapi: not declare python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/567485 (owner: 10Arturo Borrero Gonzalez)
[18:12:05] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:36:45] <icinga-wm>	 PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/
[18:37:31] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:38:33] <icinga-wm>	 RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2020-03-05 05:48:16 +0000 (expires in 37 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/
[19:15:14] <marostegui>	 !log Remove partitions from db2085 enwiki - T239453
[19:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:18] <stashbot>	 T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453
[19:16:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3311 - T239453', diff saved to https://phabricator.wikimedia.org/P10277 and previous config saved to /var/cache/conftool/dbconfig/20200127-191614-marostegui.json
[19:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:27] <wikibugs>	 (03PS2) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434)
[19:21:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn)
[19:22:36] <apergos>	 ffs flake8.  fine i'll add the exclusion to the top level tox.ini >_<
[19:23:48] <wikibugs>	 (03PS3) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434)
[19:24:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn)
[19:25:26] <wikibugs>	 (03PS4) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434)
[21:19:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:48:19] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[21:53:31] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:54:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:54:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:55:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:55:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:55:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:56:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:57:12] <Hoffman>	 waow
[21:57:14] <Hoffman>	 what fixed it?
[21:58:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[21:59:29] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:00:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:00:43] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:01:13] <paladox>	 that's a huge increase in response times for the api
[22:01:28] <vgutierrez>	 !log restarting varnish-fe on cp4028
[22:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:03:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:04:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:05:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:05:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:07:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:07:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:09:53] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:12:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:13:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:14:09] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-
[22:14:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:14:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:14:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:14:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:14:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:15:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:15:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:15:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:16:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:16:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:16:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:17:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:19:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:19:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:20:03] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[22:20:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:21:53] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[22:22:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:22:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:22:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:22:53] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Use ats-tls instead of nginx on varnish-frontend-restart [puppet] - 10https://gerrit.wikimedia.org/r/567514 (https://phabricator.wikimedia.org/T236120)
[22:23:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:23:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:23:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:23:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:23:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:23:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:24:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:25:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:26:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:27:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:29:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:30:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:30:25] <wikibugs>	 (03PS1) 10Ema: vcl: stricter rate limiting for node-fetch [puppet] - 10https://gerrit.wikimedia.org/r/567515
[22:30:41] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[22:31:55] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:31:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:31:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:32:01] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:33:05] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:33:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] vcl: stricter rate limiting for node-fetch [puppet] - 10https://gerrit.wikimedia.org/r/567515 (owner: 10Ema)
[22:33:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:34:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[22:34:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] vcl: stricter rate limiting for node-fetch [puppet] - 10https://gerrit.wikimedia.org/r/567515 (owner: 10Ema)
[22:35:45] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:39:39] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[22:41:07] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[22:41:20] <paladox>	 https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1
[22:41:25] <paladox>	 shows no thread issues.
[22:42:29] <paladox>	 Any SRE around for ^? :)
[22:42:45] <Reedy>	 We're looking into it
[22:42:57] <paladox>	 thanks
[22:43:35] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[22:44:32] <paladox>	 that's not good :(
[22:44:41] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 864 bytes in 2.062 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[22:45:05] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27384 bytes in 5.118 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[22:45:17] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36940 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[22:46:37] <apergos>	 pages...
[22:47:22] <apergos>	 got the phab cricitcal and recovery together
[22:52:31] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[22:53:59] <icinga-wm>	 PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[22:55:37] <icinga-wm>	 RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.178 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[22:56:08] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27384 bytes in 6.753 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[22:58:05] <vgutierrez>	 !log restarting gerrit service
[22:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[23:01:41] <_joe_>	 !log restart apache on gerrit
[23:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:12] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: Use ats-tls instead of nginx on varnish-frontend-restart [puppet] - 10https://gerrit.wikimedia.org/r/567514 (https://phabricator.wikimedia.org/T236120) (owner: 10Vgutierrez)
[23:02:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Use ats-tls instead of nginx on varnish-frontend-restart [puppet] - 10https://gerrit.wikimedia.org/r/567514 (https://phabricator.wikimedia.org/T236120) (owner: 10Vgutierrez)
[23:06:47] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[23:06:47] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[23:06:48] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:58] <wikibugs>	 10Operations, 10Gerrit: gerritro user getting access denied from dbproxy1007 - https://phabricator.wikimedia.org/T243800 (10Marostegui)
[23:10:17] <vgutierrez>	 !log rolling restart of  varnish-frontend in cp4026 and  cp4027
[23:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:37] <icinga-wm>	 RECOVERY - Disk space on ms-be1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1039&var-datasource=eqiad+prometheus/ops
[23:10:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:10:54] <wikibugs>	 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui)
[23:11:28] <wikibugs>	 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui)
[23:12:09] <wikibugs>	 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Paladox) I think this needs updating https://github.com/wikimedia/puppet/blob/production/modules/gerrit/manifests/jetty.pp#L44 (unless the pass is the same for the *ro user).
[23:13:25] <wikibugs>	 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10fgiunchedi) 05Open→03Resolved >>! In T242511#5834630, @Cmjohnson wrote: > @godog I replaced the disk, please see what you need to do to add it back to the raid.  Thanks!  Thanks Chris...
[23:15:45] <wikibugs>	 (03PS1) 10Ema: traffic-pool.service: replace nginx with ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/567526 (https://phabricator.wikimedia.org/T231627)
[23:16:34] <wikibugs>	 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui) The password for the ro user is different from that one (on the pw repo, which is the one I used).  I tried the one on https://github.com/wikimedia/puppet/blob/production/modul...
[23:18:02] <wikibugs>	 (03PS2) 10Ema: traffic-pool.service: replace nginx with ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/567526 (https://phabricator.wikimedia.org/T231627)
[23:18:52] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 (10SpramudyaDev)
[23:18:54] <paladox>	 marostegui couldn't we add the pass here https://github.com/wikimedia/labs-private/tree/master/hieradata/hosts for gerrit1002?
[23:24:04] <wikibugs>	 10Operations, 10Gerrit: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Paladox) Couldn't we add the pass here https://github.com/wikimedia/labs-private/tree/master/hieradata/hosts for gerrit1002?
[23:32:50] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 (10SpramudyaDev)
[23:41:30] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10PoolCounter, 10serviceops: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10Joe)
[23:41:44] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10PoolCounter, 10serviceops: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10Joe) p:05Triage→03High
[23:46:50] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10serviceops, 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10Krinkle)
[23:55:57] <wikibugs>	 10Operations, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage - https://phabricator.wikimedia.org/T243713 (10Krinkle) >>! In T243713#5833717, @Michael wrote: > This seems to have been made worse...