[00:00:23] SMalyshev: okay, so it seems nothing is deployed on beta cluster for 7 hours [00:00:26] created T231058 for that [00:00:27] T231058: No merged changes are deployed on betacluster - https://phabricator.wikimedia.org/T231058 [00:02:39] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:02:39] PROBLEM - HHVM rendering on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:03:51] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 80937 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:03:51] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 80937 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:07:30] 10Operations, 10ops-codfw: Degraded RAID on db2056 - https://phabricator.wikimedia.org/T231056 (10wiki_willy) a:03Papaul [00:43:47] (03CR) 10Herron: "Thx for the review! Please see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [00:46:07] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) Fun fact: the "gc address" of an object is packed into 14 bits, so increasing GC_ROOT_BUFFER_MAX_ENTRIES beyond 16383 causes... [03:49:33] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) PHP 7.3 rearranges the bitfield so that the address gets 20 bits. It's a binary compatibility break, and possibly a source co... [05:00:21] 10Operations, 10ops-codfw: Degraded RAID on db2056 - https://phabricator.wikimedia.org/T231056 (10Marostegui) 05Open→03Declined This host is ready to be decommissioned {T230777} so let's not replace this disk [05:00:48] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) [05:03:43] 10Operations, 10DBA: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) [05:07:26] (03PS1) 10Marostegui: report_users: Add dbproxy1019's IP [software] - 10https://gerrit.wikimedia.org/r/531802 [05:07:50] (03PS1) 10Marostegui: mariadb: Decommission db2066 [puppet] - 10https://gerrit.wikimedia.org/r/531803 (https://phabricator.wikimedia.org/T230885) [05:08:16] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1019's IP [software] - 10https://gerrit.wikimedia.org/r/531802 (owner: 10Marostegui) [05:08:40] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) [05:08:50] (03Merged) 10jenkins-bot: report_users: Add dbproxy1019's IP [software] - 10https://gerrit.wikimedia.org/r/531802 (owner: 10Marostegui) [05:08:51] !log Remove db2066 from tendril and zarcillo T230885 [05:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:59] T230885: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 [05:09:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2066 [puppet] - 10https://gerrit.wikimedia.org/r/531803 (https://phabricator.wikimedia.org/T230885) (owner: 10Marostegui) [05:10:29] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) [05:11:21] !log Stop MySQL on db2066 for decommission T230885 [05:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:28] (03CR) 10Marostegui: [C: 03+1] mariadb::dbstore_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531205 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:13:46] (03CR) 10Marostegui: [C: 03+1] mariadb::sanitarium_multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531201 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:14:11] (03CR) 10Marostegui: [C: 03+1] mariadb::misc::phabricator - eqiad: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531196 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:14:29] (03CR) 10Marostegui: [C: 03+1] mariadb::backups - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531208 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [05:16:50] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) a:05Marostegui→03RobH [05:17:04] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) This host is ready for #DC-Ops to decommission [05:17:19] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:20:10] (03CR) 10Muehlenhoff: [C: 04-1] "Sure, I'm out next week, let's discuss this when I'm back, marking as -1 for now." [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [05:22:54] (03CR) 10Muehlenhoff: "Yeah, I think this warrants some further testing in a VM/Cloud VPS, the interaction between debconf, the package postinsts and krb5-config" [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [05:40:48] (03PS1) 10Muehlenhoff: Restrict NTP servers to production networks (including frack) [puppet] - 10https://gerrit.wikimedia.org/r/531808 [05:52:26] !log installing squid3 security updates [05:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:33] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10AndyRussG) >>! In T229875#5432239, @dr0ptp4kt wrote: > @ovasileva okay to redirect Safari desktop to mdot? > > CC @DStrine @mepps @ejegg @BB... [06:31:09] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) We have configured the name servers with Elkdata, I suppose the update can be finalized now. [06:53:05] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10ssastry) >>! In T230861#5433119, @tstarling wrote: > PHP 7.3 rearranges the bitfield so that the address gets 20 bits. It's a binary com... [06:55:53] (03CR) 10Elukey: [C: 03+1] "Looks good on a cp-text node too:" [puppet] - 10https://gerrit.wikimedia.org/r/531730 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [06:58:28] (03CR) 10Elukey: "> John, is there any way to not have to disable the linter for those" [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [06:59:19] (03PS1) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531819 [07:05:17] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:06:55] (03PS2) 10Ema: ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531819 (https://phabricator.wikimedia.org/T227432) [07:08:09] PROBLEM - LVS HTTP IPv4 #page on thumbor.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.080 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:08:22] <_joe_> uhm [07:09:15] _joe_: looks like we've got a traffic increase in codfw [07:09:22] https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?panelId=42&fullscreen&orgId=1 [07:09:40] 429 increase even [07:09:42] <_joe_> yeah I'm going to look into what that is [07:09:42] RECOVERY - LVS HTTP IPv4 #page on thumbor.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 367 bytes in 10.083 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:09:44] <_joe_> yes [07:09:48] <_joe_> poolcounter protecting us :P [07:09:53] :) [07:10:01] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:10:14] <_joe_> it will fail again [07:10:26] (03CR) 10Vgutierrez: [C: 03+1] ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531819 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:12:07] <_joe_> someone is trying to create thumbnails of a largte multi-page tiff [07:12:34] those jerks! Trying to use our media. ;) [07:12:45] <_joe_> well [07:12:54] <_joe_> those are particularly pathological ccases [07:13:04] <_joe_> where we should probably not offer thumbnailing [07:13:10] <_joe_> not on-request, at least [07:13:14] (03PS1) 10Vgutierrez: cache: Allow set an arbitrary port for incoming TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/531824 (https://phabricator.wikimedia.org/T221594) [07:13:24] *nod* there are some serious outliers in resolution and complexity [07:13:38] <_joe_> like I remember someone trying to get a thumb of the 500th page of a 4k page pdf once :P [07:14:00] (03PS2) 10Vgutierrez: cache: Allow setting an arbitrary port for incoming TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/531824 (https://phabricator.wikimedia.org/T221594) [07:14:13] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Summary of the current state and results achieved: * We added... [07:14:15] <_joe_> ema: can you check the 5xx logs to see if there is any evidence of the origin? [07:14:38] _joe_: yeah, see #-security [07:16:15] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:16:37] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:03] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:30:15] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [07:31:11] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Logs [07:32:20] (03CR) 10Vgutierrez: "pcc shows almost a NOOP for text & upload: https://puppet-compiler.wmflabs.org/compiler1001/17995/" [puppet] - 10https://gerrit.wikimedia.org/r/531824 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:44:11] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) >>! In T220505#5231561, @Krenair wrote: > 'install access for WMCS' struck me as odd so I asked around a bit: > ` > Iron has been used for cloudvirt installs... [07:54:26] (03PS1) 10Elukey: cdh::hive: re-allow localjoins in hiveserver2's jvm [puppet] - 10https://gerrit.wikimedia.org/r/531866 (https://phabricator.wikimedia.org/T209536) [07:54:44] (03PS1) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) [07:55:24] (03PS2) 10Jcrespo: mariadb-package: Upgrade to the latest 10.1/10.3 options [software] - 10https://gerrit.wikimedia.org/r/519071 [08:01:29] (03CR) 10Filippo Giunchedi: [C: 03+1] profile, varnishkafka: remove logster cron entries from varnishkafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/531730 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [08:07:14] (03PS1) 10Vgutierrez: ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) [08:07:39] 10Operations, 10Traffic: Allow blocking requests from specific networks on the edge - https://phabricator.wikimedia.org/T231063 (10ema) [08:07:48] 10Operations, 10Traffic: Allow blocking requests from specific networks on the edge - https://phabricator.wikimedia.org/T231063 (10ema) p:05Triage→03Normal [08:08:36] (03CR) 10Filippo Giunchedi: [C: 04-1] "I'm ok either way re: partial rollout strategy FWIW" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [08:08:51] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531819 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:11:39] (03CR) 10Marostegui: [C: 03+1] mariadb-package: Upgrade to the latest 10.1/10.3 options [software] - 10https://gerrit.wikimedia.org/r/519071 (owner: 10Jcrespo) [08:11:41] (03PS1) 10Ema: VCL: add support for blacklisting IPs [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) [08:11:47] (03PS1) 10Ema: Define secret varnish/blocked-nets.inc.vcl.erb [labs/private] - 10https://gerrit.wikimedia.org/r/531874 (https://phabricator.wikimedia.org/T231063) [08:12:01] (03PS1) 10Vgutierrez: ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) [08:13:00] 10Operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992 (10MoritzMuehlenhoff) 05Open→03Resolved I'm closing this task in favour of T220505 [08:17:18] 10Operations, 10Traffic, 10Patch-For-Review: Allow blocking requests from specific networks on the edge - https://phabricator.wikimedia.org/T231063 (10Joe) I think it's good to have a first, simple implementation, like the one above, but I think going further we would need a "block" object in puppet (or els... [08:19:38] (03PS2) 10Elukey: cdh::hive: re-allow localjoins in hiveserver2's jvm [puppet] - 10https://gerrit.wikimedia.org/r/531866 (https://phabricator.wikimedia.org/T209536) [08:19:45] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "you need to render the template coming from secret." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) (owner: 10Ema) [08:20:04] <_joe_> ema: uhm looking at the other patch [08:20:12] <_joe_> that's not a template :P [08:20:34] <_joe_> so secret() is ok, but then don't name the file .erb [08:20:50] (03CR) 10Joal: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/531866 (https://phabricator.wikimedia.org/T209536) (owner: 10Elukey) [08:22:07] _joe_: you're right! [08:23:16] (03CR) 10Jcrespo: [C: 03+2] mariadb-package: Upgrade to the latest 10.1/10.3 options [software] - 10https://gerrit.wikimedia.org/r/519071 (owner: 10Jcrespo) [08:24:12] 10Operations, 10Traffic, 10Patch-For-Review: Allow blocking requests from specific networks on the edge - https://phabricator.wikimedia.org/T231063 (10ema) >>! In T231063#5433306, @Joe wrote: > I think it's good to have a first, simple implementation, like the one above, but I think going further we would n... [08:28:27] (03PS2) 10Ema: VCL: add support for blacklisting IPs [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) [08:28:54] (03PS2) 10Ema: Define secret varnish/blocked-nets.inc.vcl [labs/private] - 10https://gerrit.wikimedia.org/r/531874 (https://phabricator.wikimedia.org/T231063) [08:30:19] _joe_: {{done}} [08:32:00] (03CR) 10Ema: [C: 03+2] ATS: add icinga check for traffic_server restarts [puppet] - 10https://gerrit.wikimedia.org/r/531819 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:33:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] VCL: add support for blacklisting IPs [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) (owner: 10Ema) [08:34:00] (03CR) 10Ema: [V: 03+2 C: 03+2] Define secret varnish/blocked-nets.inc.vcl [labs/private] - 10https://gerrit.wikimedia.org/r/531874 (https://phabricator.wikimedia.org/T231063) (owner: 10Ema) [08:36:23] (03CR) 10Ema: "pcc output here: https://puppet-compiler.wmflabs.org/compiler1002/18000/" [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) (owner: 10Ema) [08:53:19] (03CR) 10Ema: [C: 03+1] cache: Allow setting an arbitrary port for incoming TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/531824 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:54:34] PROBLEM - Apache HTTP on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 2.495 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:55:44] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:58:30] (03CR) 10Elukey: [C: 03+2] cdh::hive: re-allow localjoins in hiveserver2's jvm [puppet] - 10https://gerrit.wikimedia.org/r/531866 (https://phabricator.wikimedia.org/T209536) (owner: 10Elukey) [08:58:38] (03PS3) 10Elukey: cdh::hive: re-allow localjoins in hiveserver2's jvm [puppet] - 10https://gerrit.wikimedia.org/r/531866 (https://phabricator.wikimedia.org/T209536) [09:02:53] 10Operations, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10tstarling) Tracing of TagTk::__destruct() shows that tokens are freed as it goes, they're not the problem. Disabling the GC causes the r... [09:05:17] (03CR) 10Daimona Eaytoy: "> Should be ready now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 (owner: 10Daimona Eaytoy) [09:23:02] (03PS1) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [09:24:18] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:27:04] (03PS2) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [09:35:37] (03PS2) 10Jbond: mariadb::dbstore_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531205 (https://phabricator.wikimedia.org/T102099) [09:38:36] (03CR) 10Jbond: "small issue due to my recent changes" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) (owner: 10Muehlenhoff) [09:38:44] (03CR) 10Jbond: [C: 03+2] mariadb::dbstore_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531205 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:39:20] (03CR) 10Muehlenhoff: Decom iron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) (owner: 10Muehlenhoff) [09:41:02] (03PS2) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) [09:41:36] (03CR) 10Jbond: [C: 03+1] Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) (owner: 10Muehlenhoff) [09:43:27] (03PS2) 10Jbond: mariadb::sanitarium_multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531201 (https://phabricator.wikimedia.org/T102099) [09:43:39] (03PS1) 10Muehlenhoff: Remove explicity config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 [09:44:03] (03CR) 10Jbond: [C: 03+2] mariadb::sanitarium_multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531201 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:44:12] (03CR) 10jerkins-bot: [V: 04-1] Remove explicity config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 (owner: 10Muehlenhoff) [09:45:01] (03PS2) 10Muehlenhoff: Remove explicity config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 [09:45:03] (03PS3) 10Jbond: Remove explicity config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 (owner: 10Muehlenhoff) [09:45:23] (03PS4) 10Muehlenhoff: Remove explicit config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 [09:45:44] (03CR) 10Jbond: [C: 03+1] "thanks LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531887 (owner: 10Muehlenhoff) [09:46:34] (03CR) 10Jbond: [C: 03+2] Remove explicit config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 (owner: 10Muehlenhoff) [09:46:41] (03PS5) 10Jbond: Remove explicit config for mapped ipv6 address from a few spare systems [puppet] - 10https://gerrit.wikimedia.org/r/531887 (owner: 10Muehlenhoff) [09:48:21] (03PS2) 10Jbond: mariadb::misc::phabricator - eqiad: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531196 (https://phabricator.wikimedia.org/T102099) [09:49:15] (03CR) 10Jbond: [C: 03+2] mariadb::misc::phabricator - eqiad: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531196 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:51:17] (03PS2) 10Jbond: mariadb::backups - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531208 (https://phabricator.wikimedia.org/T102099) [09:52:14] (03CR) 10Jbond: [C: 03+2] mariadb::backups - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531208 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:55:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [10:06:02] !log elastic: reindexing wikis with old mappings in eqiad & codfw (T230990) [10:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:27] T230990: create_timestamp not present on production index mappings for some wikis - https://phabricator.wikimedia.org/T230990 [10:12:59] (03PS2) 10Muehlenhoff: Enable puppetdb1002/2002 as puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531697 [10:14:25] (03CR) 10Muehlenhoff: [C: 03+2] Enable puppetdb1002/2002 as puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531697 (owner: 10Muehlenhoff) [10:22:02] (03PS1) 10Muehlenhoff: Add support for Buster in postgres manifests [puppet] - 10https://gerrit.wikimedia.org/r/531892 [10:23:51] (03PS1) 10Jbond: wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 [10:26:58] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [10:28:10] (03PS1) 10Ema: ATS: enable compress.so everywhere [puppet] - 10https://gerrit.wikimedia.org/r/531895 (https://phabricator.wikimedia.org/T227432) [10:28:12] (03PS1) 10Ema: cache: reimage cp1075 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/531896 (https://phabricator.wikimedia.org/T227432) [10:29:04] (03PS2) 10Ema: cache: reimage cp1075 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/531896 (https://phabricator.wikimedia.org/T227432) [10:30:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/531892 (owner: 10Muehlenhoff) [10:31:25] (03PS1) 10Volans: sre.hosts.decommission: enhance capabilities [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) [10:33:16] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: enhance capabilities [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [10:34:10] (03PS2) 10Volans: sre.hosts.decommission: enhance capabilities [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) [10:39:48] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18002/" [puppet] - 10https://gerrit.wikimedia.org/r/531892 (owner: 10Muehlenhoff) [10:39:55] (03PS2) 10Muehlenhoff: Add support for Buster in postgres manifests [puppet] - 10https://gerrit.wikimedia.org/r/531892 [10:40:02] (03CR) 10Marostegui: [C: 03+1] mariadb::dbstore_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531206 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:40:23] (03CR) 10Marostegui: [C: 03+1] role::mariadb::misc - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531167 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:41:17] (03CR) 10Marostegui: [C: 03+1] mariadb::core - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:41:24] (03CR) 10Muehlenhoff: [C: 03+2] Add support for Buster in postgres manifests [puppet] - 10https://gerrit.wikimedia.org/r/531892 (owner: 10Muehlenhoff) [10:42:03] (03CR) 10Marostegui: [C: 03+1] mariadb::proxy - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531210 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:45:10] (03PS1) 10Muehlenhoff: Revert "Enable puppetdb1002/2002 as puppetdb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/531899 [10:46:39] (03PS3) 10Volans: sre.hosts.decommission: enhance capabilities [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) [10:48:12] (03CR) 10Jbond: [C: 03+1] Revert "Enable puppetdb1002/2002 as puppetdb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/531899 (owner: 10Muehlenhoff) [10:49:23] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Enable puppetdb1002/2002 as puppetdb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/531899 (owner: 10Muehlenhoff) [10:49:33] (03PS4) 10Jbond: mariadb::dbstore_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531206 (https://phabricator.wikimedia.org/T102099) [10:50:26] (03CR) 10Jbond: [C: 03+2] mariadb::dbstore_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531206 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:54:08] (03PS3) 10Jbond: role::mariadb::misc - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531167 (https://phabricator.wikimedia.org/T102099) [10:56:42] (03CR) 10Jbond: [C: 03+2] role::mariadb::misc - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531167 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:02:39] (03PS3) 10Jbond: mariadb::core - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) [11:03:32] (03CR) 10Jbond: [C: 03+2] mariadb::core - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:13:11] !log Upgrade db1114 from 10.3.16 to 10.3.17 [11:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:57] 10Operations, 10Traffic, 10netops, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jcrespo) Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (or all hosts that are planned above being... [11:17:32] (03PS3) 10Jbond: mariadb::proxy - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531210 (https://phabricator.wikimedia.org/T102099) [11:18:02] (03PS1) 10Marostegui: control-mariadb-10.3: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 [11:18:09] (03CR) 10Jbond: [C: 03+2] mariadb::proxy - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531210 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:18:49] (03CR) 10Jcrespo: [C: 04-1] control-mariadb-10.3: Upgrade version (031 comment) [software] - 10https://gerrit.wikimedia.org/r/531902 (owner: 10Marostegui) [11:23:00] (03PS2) 10Marostegui: control-mariadb-10.3: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 [11:24:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:27:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:28:49] (03CR) 10Jcrespo: [C: 03+1] control-mariadb-10.3: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 (owner: 10Marostegui) [11:33:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [11:35:09] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:39:41] (03CR) 10Muehlenhoff: sre.hosts.decommission: enhance capabilities (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [11:39:51] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:42:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [11:47:37] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:49:06] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [11:52:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [11:53:43] yep that's actually true, php7 latency spiked [11:53:49] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:56:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:58:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:58:51] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&var-code=204 [11:58:51] That does look odd [12:02:10] can't find an obvious smoking gun in SAL [12:02:51] all servers affected or just some, godog ? [12:03:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:04:38] Krinkle: I'm not sure [12:05:01] checking [12:05:22] seems to affect POST as well [12:05:23] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=POST&var-code=200&var-code=400 [12:07:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:08:39] yeah mw1270 seems to be the culprit [12:10:25] and puppet is disabled, thoughts _joe_ effie ^ ? [12:10:51] <_joe_> godog: ugh it happened again? [12:11:02] <_joe_> yes it was disabled because I was testing some mitigations [12:11:03] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:11:04] <_joe_> :/ [12:11:36] <_joe_> Krinkle: there is a task about the recursive slowness on the api just for php [12:11:47] <_joe_> I'm trying to get to the bottom of what causes it [12:12:12] recursive or you mean recurring? [12:12:41] <_joe_> recurring :P [12:13:06] honestly, I wouldn't be surprised if there is a recursive issue too, wouldn't be the first time with templates :-D [12:15:02] _joe_: heh the alert was for appserver not api, same issue? [12:15:03] <_joe_> !log depooling mw1270 temporarily, performance issues [12:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:11] <_joe_> it's different [12:15:21] <_joe_> the one on the API is weirder even [12:17:46] <_joe_> so it's definitely a different problem in the two cases [12:17:59] <_joe_> for the api I suspect it's a code pattern that's pathological on php7 [12:18:17] <_joe_> on the appservers... it seems just that the server gets in some form of deadlock [12:19:03] PROBLEM - PHP opcache health on mw1270 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:19:05] <_joe_> https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw1270&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&from=1566556558299&to=1566562667390 [12:19:09] <_joe_> is quite telling [12:19:17] <_joe_> uh that's interesting [12:23:45] RECOVERY - PHP opcache health on mw1270 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:24:01] <_joe_> !log pooling mw1270 temporarily, debugging performance issues [12:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:07] <_joe_> so, as happened before [12:25:17] <_joe_> repooling heals mw1270 [12:25:19] <_joe_> wtf [12:33:11] (03CR) 10Muehlenhoff: sre.hosts.decommission: enhance capabilities (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [12:37:49] (03PS8) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [12:37:51] (03PS1) 10Mathew.onipe: elasticsearch: ship logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) [12:39:15] (03PS2) 10Mathew.onipe: elasticsearch: ship logs to syslog server [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) [12:57:25] (03PS3) 10Marostegui: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 [13:02:48] (03PS4) 10Marostegui: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 [13:04:01] (03PS5) 10Marostegui: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 [13:16:10] (03CR) 10Elukey: "From my limited understanding it looks good :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [13:31:00] (03CR) 10Volans: "> Patch Set 3:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [13:35:27] 10Operations, 10serviceops, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe) First smoking gun is in all the intervals I controlled the offender was parsoid-batch with quite large requests. I'm tryi... [13:39:31] (03CR) 10Jcrespo: [C: 03+1] control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 (owner: 10Marostegui) [13:45:37] zeljkof: Re T231071, I can't reproduce it. From the trace it looks like it's coming from the Memcached (PECL) PHP extension trying to load a serialized object that is too big to unserialize. Possibly an HHVM/PHP7 issue, if one can save objects larger than the other can unserialize, but that's just a guess. Inability to reproduce may be due to the relevant edit having been deleted on enwikibooks. [13:45:37] T231071: /w/api.php... ErrorException from line 0 of : PHP Notice: Unable to unserialize ... Size of serialized string ... exceeds max - https://phabricator.wikimedia.org/T231071 [13:46:57] anomie: hm, good thing is that it's not happening a lot, but still 800+ hits in the logs in the last day [13:47:22] please leave a comment in phab and remove from train blockers if you think it's not blocking the train [13:51:26] zeljkof: Wait, I can reproduce. Whatever is going on logs an error but does not prevent the page output. Weird. [13:51:49] strange [13:55:13] (03PS2) 10BBlack: Un-submodule for nginx: move to prod env [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/521323 (https://phabricator.wikimedia.org/T183454) [13:55:15] (03PS2) 10BBlack: Un-submodule for nginx: rename to orig path [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/521324 [13:55:23] zeljkof: do you have a logstash shortlink handy? [13:55:56] cdanis: for what? T231071? [13:55:56] T231071: /w/api.php... ErrorException from line 0 of : PHP Notice: Unable to unserialize ... Size of serialized string ... exceeds max - https://phabricator.wikimedia.org/T231071 [13:56:04] yeah :) [13:56:19] (03CR) 10jerkins-bot: [V: 04-1] Un-submodule for nginx: move to prod env [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/521323 (https://phabricator.wikimedia.org/T183454) (owner: 10BBlack) [13:56:21] mostly wanted to see if it was correlated with just one PHP version [13:56:29] cdanis: sure, I can create one, just a second [13:58:15] ah, I just found it, no worries [13:58:22] cdanis: https://logstash.wikimedia.org/goto/8d81481daa6366be3dac20be287fecca [13:58:39] ah, just saw your comment [13:58:46] in past 24h: Count by PHP version: 5.6.99-hhvm: 100% (948) [13:58:54] so anomie I think you're spot on about a PHP7-vs-HHVM issue [14:01:22] <_joe_> I would think it's some serialization difference though [14:01:30] <_joe_> now if we had the key in memcached in the logs [14:01:33] <_joe_> we could verify it [14:01:46] <_joe_> what key we were retreiving I mean [14:01:55] does seem like something that should be logged [14:04:50] zeljkof, cdanis: It looks like it's related to https://en.wikibooks.org/w/index.php?title=Wikijunior:Biology&diff=prev&oldid=3564564, which was an edit replacing the page with 479808 lines of the word "gay". Most of the hits seem to be from for RSS feeds that would include that edit, but there are some attempting to view the actual diff too. [14:16:13] _joe_: Cache key looks to be "enwikibooks:diff:wikidiff2:1.12:old-3525109:rev-3564564:1.8.1". [14:17:50] <_joe_> anomie: thanks, I can try to report what's in there [14:20:51] 10Operations, 10ops-eqsin: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10Fano) [14:22:22] 10Operations, 10ops-eqsin: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10Der_Keks) [14:23:51] 10Operations, 10ops-eqsin, 10Traffic: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10elukey) [14:29:24] 10Operations, 10Traffic, 10media-storage: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10CDanis) Looks like we are indeed serving 404s for this object from codfw and all points west: `✔️ cdanis@cdanis ~ 🕥☕ for DC in esams eqiad codfw ulsfo eqsin ; do echo -ne "$... [14:30:56] nice cdanis --^ [14:35:18] anomie: just noticed this in logs :) `ConfigPaths.php: PHP Notice: Writing to /home/anomie/.config/psysh is not allowed.` [14:36:13] (03CR) 10Gehel: [C: 04-1] elasticsearch: ship logs to syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [14:36:23] zeljkof: Probably from me running `mwscript maintenance/shell.php` [14:36:44] I think the server was mwmaint or something like that [14:42:33] 10Operations, 10Traffic, 10media-storage: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10CDanis) Indeed, the object exists on eqiad, but never made it to codfw swift: `✔️ cdanis@ms-fe1005.eqiad.wmnet ~ 🕥☕ swift stat wikipedia-commons-local-public.8c '8/8c/Vier_F... [14:42:57] (03PS1) 10DCausse: [cloudelastic] Increase the heap up to 45G [puppet] - 10https://gerrit.wikimedia.org/r/531936 [14:44:40] zeljkof: mwscript sudos to www-data, but apparently $HOME is still my own home directory where www-data doesn't have permission to create the file. [14:47:56] * anomie tries creating ~/.config/psysh/psysh_history manually, but it probably didn't make a difference. [14:53:20] is there something wrong with logstash.wikimedia.org? [14:53:30] I can't get any dashboards to load :/ [14:55:46] hm, looks like it's just my chrome [14:55:50] works fine in firefox [14:57:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) 05Open→03Resolved Finished the idrac setup. on-site work is complete [14:58:13] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Cmjohnson) The ticket was declined by Dell....stating that the disk we have installed are not original to the server. this requires me to investigate [15:01:58] 10Operations, 10Traffic, 10media-storage: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10aaron) Isn't there a swiftrepl background process to fix this? [15:02:02] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Tgr) Reviewing the discussion so far, these are the concrete steps, and some questions about th... [15:03:13] 10Operations, 10Traffic, 10media-storage: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10CDanis) >>! In T231086#5433993, @aaron wrote: > Isn't there a swiftrepl background process to fix this? @fgiunchedi tells me that we turned off swiftrepl once the work in {T... [15:06:44] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) We will have to do most of this anyway for T226704. [15:10:01] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) > There are two clusters in the normal ES, with writes (if I understand correctly) ran... [15:10:03] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [15:10:35] 10Operations, 10Traffic, 10media-storage: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10fgiunchedi) >>! In T231086#5433994, @CDanis wrote: >>>! In T231086#5433993, @aaron wrote: >> Isn't there a swiftrepl background process to fix this? > > @fgiunchedi tells me... [15:15:09] 10Operations, 10Puppet, 10Documentation, 10Patch-For-Review, 10patch-welcome: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10hashar) [15:16:23] PROBLEM - Host an-conf1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:19] RECOVERY - Host an-conf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:24:05] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:25:50] (03PS3) 10Giuseppe Lavagetto: safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 [15:25:52] (03PS3) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [15:27:54] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 (owner: 10Giuseppe Lavagetto) [15:27:59] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [15:31:12] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Cmjohnson) I replaced the failed disk [15:31:56] (03PS4) 10Giuseppe Lavagetto: safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 [15:31:58] (03PS4) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [15:33:33] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [15:36:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: add the ability to just depool or repool a server [puppet] - 10https://gerrit.wikimedia.org/r/531509 (owner: 10Giuseppe Lavagetto) [15:37:06] (03CR) 10Gehel: [C: 03+2] [cloudelastic] Increase the heap up to 45G [puppet] - 10https://gerrit.wikimedia.org/r/531936 (owner: 10DCausse) [15:37:15] (03PS2) 10Gehel: [cloudelastic] Increase the heap up to 45G [puppet] - 10https://gerrit.wikimedia.org/r/531936 (owner: 10DCausse) [15:38:49] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Gehel) a:05Cmjohnson→03Gehel @Cmjohnson thanks! I'll take it over and reimage [15:44:01] 10Operations, 10Traffic, 10media-storage: Picture from commons not found from Singapure - https://phabricator.wikimedia.org/T231086 (10aaron) Still, a file was only uploaded, and no other operations done...I'm not sure why the DB would commit if the file store failed in one of the FileBackendMultiwrite backe... [15:45:59] 10Operations, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Ammarpad) [15:50:52] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) [15:51:48] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10ovasileva) > In the longer term, we do need to think about responsive design for banners, and how CentralNotice and Advancement banners (like... [15:54:46] (03PS1) 10Ayounsi: Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) [15:55:01] Krinkle: you don't happen to have any bright ideas about where / how to look in logstash for errors related to T231086 do you? [15:55:02] T231086: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 [15:58:34] cdanis: Nope, other than mediawiki-errors [15:58:41] cdanis: https://logstash.wikimedia.org/goto/ebbe577f7e23200b818be2c35da58690 [15:58:47] Aug 23-25, for "swift" [15:58:50] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/18003/" [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:59:34] [{exception_id}] {exception_url} UploadChunkFileException from line 366 of /srv/mediawiki/php-1.34.0-wmf.15/includes/upload/UploadFromChunks.php: Error storing file in '/tmp/8X2t8D': backend-fail-internal; local-swift-codfw [15:59:44] the file in question was uploaded a month ago [16:00:01] oh, you wrote Aug here but you queried for July on logstash ;) [16:00:12] cdanis: Looks like XThwpwpAMDsAAAljTfsAAACN or XThwswpAMDoAAI8PtlYAAABY might align [16:00:12] (03PS4) 10Ayounsi: Pmacct, add source and destination countries based on GeoIP DB [puppet] - 10https://gerrit.wikimedia.org/r/531752 [16:03:21] https://logstash.wikimedia.org/goto/27bc8fe32d4441028fca86d41ae5f535 [16:03:32] this is happening often enough to be a concern IMO [16:04:42] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) The file was uploaded July 24, for which the logs will be deleted over the next 24 hours. Actual upload was logged at "14... [16:04:53] cdanis: posted some details on the task so that it won't expire tomorrow [16:05:17] Krinkle: logs expiring from where? [16:05:22] we have some in logstash from >30d ago [16:05:32] cdanis: oh, we keep 60 days now, right [16:05:59] See also T228292, and T38587 [16:06:00] T228292: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 [16:06:00] T38587: Fatal error when uploading a file to Commons (UploadStashFileNotFoundException) - https://phabricator.wikimedia.org/T38587 [16:06:20] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/18004/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [16:07:08] * Krinkle testing on mwdebug1002 for Abusefilter with Daimona [16:07:09] logstash is at 90d now btw [16:07:17] Daimona: applied for itwiki, test2wiki and enwiki [16:07:36] godog: thx, updated my comment [16:15:40] * Krinkle is done with testing on mwdebug1002 [16:25:44] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10ayounsi) a:03ayounsi That's the change that need to be pushed to cr1/2-eqiad: `lang=diff [edit firewall family inet filter labs-instance-in4 term la... [16:54:32] 10Operations: apply hostname label for wmf5176/gerrit1001 - https://phabricator.wikimedia.org/T231047 (10Jclark-ctr) 05Open→03Resolved host labled with gerrit1001 [16:54:35] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Jclark-ctr) [17:03:31] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10CDanis) I did some digging in the swift logs around the time the file was uploaded; there's no record of swift in codfw ever recei... [17:09:55] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10CDanis) BTW for posterity, here's how I looked for logs: on one of the syslog centralservers (e.g. wezen): `ls /srv/syslog/archiv... [17:25:43] (03PS7) 10Herron: prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [17:26:17] (03CR) 10Herron: prometheus: add prometheus ipsec exporter service & config in ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [17:27:16] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Tgr) >>! In T107610#5434011, @jcrespo wrote: > We will have to do most of this anyway for T2267... [17:36:52] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10RobH) [17:39:31] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:39:58] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [17:40:03] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:07] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `graphite2002.codfw.wmnet` - graphite2002.codfw.wmnet -... [17:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:33] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10RobH) [17:42:04] (03PS1) 10Strainu: [rowiki] Allow sysops to name patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531956 (https://phabricator.wikimedia.org/T231099) [17:43:39] (03PS1) 10RobH: decom use of graphite2002 hostname [puppet] - 10https://gerrit.wikimedia.org/r/531957 (https://phabricator.wikimedia.org/T200210) [17:45:20] (03PS1) 10RobH: decom graphite2002 dns use [dns] - 10https://gerrit.wikimedia.org/r/531958 (https://phabricator.wikimedia.org/T200210) [17:45:45] (03CR) 10RobH: [C: 03+2] decom use of graphite2002 hostname [puppet] - 10https://gerrit.wikimedia.org/r/531957 (https://phabricator.wikimedia.org/T200210) (owner: 10RobH) [17:46:02] (03CR) 10RobH: [C: 03+2] decom graphite2002 dns use [dns] - 10https://gerrit.wikimedia.org/r/531958 (https://phabricator.wikimedia.org/T200210) (owner: 10RobH) [17:47:30] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10RobH) [17:47:37] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10RobH) [17:48:45] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2002 & return server to spares pool - https://phabricator.wikimedia.org/T200210 (10RobH) a:05RobH→03Papaul @papaul, This is ready for disk wipe, hostname label removal, and then to be returned to the spares pool. Please note I'... [18:04:44] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) Great. We are updating with the registrar now. [18:11:26] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:14:42] !log Dropped 2FA for User:DBrant (WMF), per request. [18:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:00] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10CDanis) [18:24:15] (03CR) 10Cwhite: [C: 03+1] "Looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [18:26:46] 10Operations, 10Commons, 10MediaWiki-File-management, 10media-storage: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 (10CDanis) [18:50:42] (03PS1) 10CDanis: Re-enable logging of files needing to be synced, since this is now unexpected in a post-active/active world. [software] - 10https://gerrit.wikimedia.org/r/531964 (https://phabricator.wikimedia.org/T231110) [18:51:15] (03PS2) 10CDanis: swiftrepl: log on replications [software] - 10https://gerrit.wikimedia.org/r/531964 (https://phabricator.wikimedia.org/T231110) [18:59:17] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) I wanted to check with @wiki_willy if we need to reclaim this to spares, or if we can decommission and pull it out of the rack. The host was purchased on 2015-01-09, and will... [19:06:12] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) [19:09:30] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) [19:10:14] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:10:46] heh, im in the middle of decoms so thats not shocking [19:11:00] (netbox reports will show errors when im in the middle of the non-interrupt decom steps) [19:11:28] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [19:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:37] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `graphite2001.codfw.wmnet` - graphite2001.codfw.wmnet - Removed from Puppet master and... [19:13:27] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) [19:16:56] (03PS1) 10RobH: decom graphite2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) [19:19:09] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10RobH) [19:19:30] (03CR) 10RobH: [C: 03+2] decom graphite2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) (owner: 10RobH) [19:20:32] (03CR) 10jerkins-bot: [V: 04-1] decom graphite2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) (owner: 10RobH) [19:21:24] bleh statsd.codfw.wmnet points to graphite2001 [19:21:34] godog: you about? [19:22:38] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10wiki_willy) @RobH - I'll leave it up to @Papaul, since he has a better idea on the chances of reusing the parts on this system. Thanks, Willy [19:32:30] (03CR) 10RobH: [C: 03+2] "So statsd.codfw.wmnet points to graphite2001.codfw.wmnet, so I'm not sure what to point this at." [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) (owner: 10RobH) [19:34:27] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10RobH) Ok, I synced with @wiki_willy about this and the comment above. We're going to decommission this host, since it has 5 months of life left before... [19:35:22] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10RobH) https://gerrit.wikimedia.org/r/c/operations/dns/+/531965 statsd.codfw.wmnet points to graphite2001.codfw.wmnet, so I'm not sure what to point... [19:36:39] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:38:59] (03PS1) 10RobH: graphite2001 decom [puppet] - 10https://gerrit.wikimedia.org/r/531968 (https://phabricator.wikimedia.org/T200209) [19:39:35] (03CR) 10RobH: [C: 03+2] graphite2001 decom [puppet] - 10https://gerrit.wikimedia.org/r/531968 (https://phabricator.wikimedia.org/T200209) (owner: 10RobH) [19:40:40] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10RobH) [19:41:12] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10RobH) a:05RobH→03Papaul [19:43:24] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Der_Keks) A stupid question: Are there any techniques implemented to synchronize replications or to report faulty replications? Th... [19:44:11] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10CDanis) @Der_Keks yes, that is the purpose of the aforementioned `swiftrepl` daemon. [19:45:15] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Der_Keks) Ah okay and that's obviously disabled I understand. [19:49:59] (03PS8) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 [19:51:34] (03CR) 10jerkins-bot: [V: 04-1] Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [20:20:29] (03PS1) 10CDanis: dbctl: always validate vs JSON schema [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 [20:20:31] (03PS1) 10CDanis: dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) [20:22:58] (03CR) 10jerkins-bot: [V: 04-1] dbctl: always validate vs JSON schema [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 (owner: 10CDanis) [20:23:42] (03CR) 10jerkins-bot: [V: 04-1] dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) (owner: 10CDanis) [20:27:50] (03PS2) 10CDanis: dbctl: always validate vs JSON schema [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 [20:27:52] (03PS2) 10CDanis: dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) [20:40:27] Hi [20:40:51] How do I report a bug if Phab is OOS? *g* [20:41:01] https://phabricator.wikimedia.org/ [20:41:06] Request from 2401:4900:16d4:d32a:50ed:1c29:40d9:64ba via cp2010 cp2010, Varnish XID 473900475 [20:41:06] Error: 503, Backend fetch failed at Fri, 23 Aug 2019 20:39:25 GMT [20:41:52] I am trying to find a solution for https://commons.wikimedia.org/wiki/User_talk:Rillke/Discuss/2019#Chunk_upload_not_working [20:42:20] I am trying to upload a big PDF file from https://archive.org/details/20190823_20190823_0935 [20:42:43] 473 MB [20:42:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:42:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:42:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:43:51] I want to upload_by_url, as my connecton is limited [20:44:09] should be uploaded here https://commons.wikimedia.org/wiki/File:%E0%A4%B6%E0%A4%BF%E0%A4%B2%E0%A5%8D%E0%A4%AA%E0%A4%95%E0%A4%BE%E0%A4%B0_%E0%A4%9A%E0%A4%B0%E0%A4%BF%E0%A4%A4%E0%A5%8D%E0%A4%B0%E0%A4%95%E0%A5%8B%E0%A4%B6_%E0%A4%96%E0%A4%82%E0%A4%A1_%E0%A5%AD_-_%E0%A4%9A%E0%A4%BF%E0%A4%A4%E0%A5%8D%E0%A4%B0%E0%A4%AA%E0%A4%9F,_%E0%A4%B8%E0%A4%82%E0%A4%97%E0%A5%80%E0%A4%A4.pdf [20:44:17] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:44:31] I also tried to upload as a new file: idem [20:47:10] phabricator is up for me yannf [20:47:14] can you recheck? [20:47:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:48:05] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team Workboards (Clinic Duty Team): Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) >>! In T224553#5431986, @Eevans wrote: >>>! In T224553#5431770, @MoritzMuehlenhoff wrote: >... [20:48:59] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:49:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:49:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:49:36] hauskatze, ok, working now https://phabricator.wikimedia.org/T231119 [20:49:51] Great :) [20:50:39] So yannf it looks like that big PDF might require a deployer to upload it directly to our databases [20:50:48] I'll take a look [20:50:55] PROBLEM - High CPU load on API appserver on mw2203 is CRITICAL: connect to address 10.192.32.91 port 5666: No route to host https://wikitech.wikimedia.org/wiki/Application_servers [20:51:41] hauskatze, I have uploaded files bigger than that without issue [20:51:51] thanks [20:52:27] Yeah, doesn't look like the file is one of those giantic ones, but halas, no idea [20:52:31] RECOVERY - High CPU load on API appserver on mw2203 is OK: OK - load average: 0.27, 0.31, 0.28 https://wikitech.wikimedia.org/wiki/Application_servers [20:53:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:57:59] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10CDanis) [21:01:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:03:32] cdanis: hi is T231119 related to T231108? [21:03:34] T231108: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 [21:03:34] T231119: Uploading a big PDF file failed - https://phabricator.wikimedia.org/T231119 [21:09:05] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Slaporte) 05Open→03Resolved The nameserver is updated and appears to be working! [21:17:51] hauskatze: very indirectly [21:18:21] it has to do with the corner case where an upload is reported as successful to the user, but actually failed to upload to one cluster [21:18:31] (see T231119's parent task for more info) [21:18:32] T231119: Uploading a big PDF file failed - https://phabricator.wikimedia.org/T231119 [21:22:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:22:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:23:35] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:23:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:23:53] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:25:41] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:31:37] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Quiddity) Thanks all, and happy 17th birthday to Estonian Wikipedia! [21:32:49] PROBLEM - High CPU load on API appserver on mw2244 is CRITICAL: connect to address 10.192.0.70 port 5666: No route to host https://wikitech.wikimedia.org/wiki/Application_servers [21:32:53] PROBLEM - Host search.svc.codfw.wmnet is DOWN: CRITICAL - Time to live exceeded (10.2.1.30) [21:32:57] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:33:11] bink [21:33:14] what's up? [21:33:16] "Time to live exceeded" lol? [21:33:19] RECOVERY - Host search.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [21:33:23] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:33:39] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:33:50] <_joe_> there is some problem connecting to codfw I'd say? [21:34:27] RECOVERY - High CPU load on API appserver on mw2244 is OK: OK - load average: 0.46, 0.33, 0.29 https://wikitech.wikimedia.org/wiki/Application_servers [21:34:41] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:36:17] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:36:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:40:21] PROBLEM - High CPU load on API appserver on mw2137 is CRITICAL: connect to address 10.192.16.110 port 5666: No route to host https://wikitech.wikimedia.org/wiki/Application_servers [21:40:59] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [21:41:49] (see T231119's parent task for more info) <-- hmm, no parent task for that Task, did you mean T231108? [21:41:49] T231108: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 [21:41:50] T231119: Uploading a big PDF file failed - https://phabricator.wikimedia.org/T231119 [21:41:57] RECOVERY - High CPU load on API appserver on mw2137 is OK: OK - load average: 0.12, 0.20, 0.24 https://wikitech.wikimedia.org/wiki/Application_servers [21:42:53] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:43:09] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:44:43] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:46:03] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:47:21] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:47:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:47:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:47:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:48:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [21:48:25] !log increase ospf cost of zayo codfw-eqiad link to 1320 (was 320) to make it secondary [21:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:59] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:49:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:49:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:49:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:49:41] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:50:53] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:51:07] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:51:15] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:52:11] 10Operations, 10Traffic, 10User-DannyS712: 503: backend fetch failed - https://phabricator.wikimedia.org/T231121 (10DannyS712) [21:52:43] (03PS1) 10Cwhite: hiera: change routing table to remove codfw [puppet] - 10https://gerrit.wikimedia.org/r/531976 [21:53:14] Hmm, I carnt view the content from [21:53:14] https://en.m.wikipedia.org/wiki/List_of_television_awards on mobile [21:53:45] <_joe_> me neither paladox but we have other problems [21:53:57] Ah ok [21:54:01] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:54:15] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:54:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Let's wait 5 minutes tops then we should merge and apply." [puppet] - 10https://gerrit.wikimedia.org/r/531976 (owner: 10Cwhite) [21:55:19] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:55:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:55:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:55:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:56:39] (03PS1) 10Cwhite: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/531977 [21:59:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Change correct, but hopefully not needed :)" [dns] - 10https://gerrit.wikimedia.org/r/531977 (owner: 10Cwhite) [22:03:43] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:03:48] 10Operations, 10Commons, 10Internet-Archive: Uploading a big PDF file failed - https://phabricator.wikimedia.org/T231119 (10Urbanecm) Tried via mwdebug1001, so I can distinquish various attempts from each other. `name=mwlog1001:/srv/mw-log/apache2.log Aug 23 22:00:31 mwdebug1001: [proxy_fcgi:error] [pid 308... [22:06:39] 10Operations, 10Commons, 10Internet-Archive: Uploading a big PDF file failed - https://phabricator.wikimedia.org/T231119 (10Urbanecm) >>! In T231119#5434915, @Urbanecm wrote: > It may or may not be related to what @MarcoAurelio says, tagging #operations to judge on that. (althrough given I used `mwdebug1001... [22:13:57] (03Abandoned) 10Cwhite: hiera: change routing table to remove codfw [puppet] - 10https://gerrit.wikimedia.org/r/531976 (owner: 10Cwhite) [22:14:07] (03Abandoned) 10Cwhite: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/531977 (owner: 10Cwhite) [22:16:17] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:16:23] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:17:57] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:20:59] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:22:31] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:22:39] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:24:11] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:42:57] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:43:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:44:33] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:46:11] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:02:19] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 (10RobH) [23:03:28] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [23:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:34] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:03:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db2045.codfw.wmnet` - db2045.codfw.wmnet - Removed from Puppet master and Puppet... [23:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:25] (03PS2) 10RobH: decom graphite2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) [23:06:27] (03PS1) 10RobH: db2045 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/531979 (https://phabricator.wikimedia.org/T228281) [23:06:47] (03CR) 10jerkins-bot: [V: 04-1] decom graphite2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) (owner: 10RobH) [23:06:52] (03CR) 10jerkins-bot: [V: 04-1] db2045 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/531979 (https://phabricator.wikimedia.org/T228281) (owner: 10RobH) [23:07:04] (03Abandoned) 10RobH: db2045 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/531979 (https://phabricator.wikimedia.org/T228281) (owner: 10RobH) [23:08:30] (03PS1) 10RobH: db2045 prod entry removal [dns] - 10https://gerrit.wikimedia.org/r/531980 (https://phabricator.wikimedia.org/T228281) [23:08:37] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:09:24] (03CR) 10RobH: [C: 03+2] db2045 prod entry removal [dns] - 10https://gerrit.wikimedia.org/r/531980 (https://phabricator.wikimedia.org/T228281) (owner: 10RobH) [23:09:27] (03PS1) 10RobH: db2045 decom [puppet] - 10https://gerrit.wikimedia.org/r/531981 (https://phabricator.wikimedia.org/T228281) [23:09:59] (03CR) 10RobH: [C: 03+2] db2045 decom [puppet] - 10https://gerrit.wikimedia.org/r/531981 (https://phabricator.wikimedia.org/T228281) (owner: 10RobH) [23:11:45] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:11:46] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 (10RobH) [23:13:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2045.codfw.wmnet - https://phabricator.wikimedia.org/T228281 (10RobH) a:05RobH→03Papaul [23:34:22] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:34:31] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db2034.codfw.wmnet` - db2034.codfw.wmnet - Removed from Puppet master and PuppetDB - Downtimed host on... [23:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:12] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10RobH) I cannot locate the labeled switch port on the switch, so @papaul will need to trace and disable this via on-site work. [23:37:39] (03PS1) 10RobH: decom db2034 [puppet] - 10https://gerrit.wikimedia.org/r/531986 (https://phabricator.wikimedia.org/T223216) [23:38:22] (03PS1) 10RobH: db2034 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/531987 (https://phabricator.wikimedia.org/T223216) [23:39:08] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:09] (03CR) 10RobH: [C: 03+2] decom db2034 [puppet] - 10https://gerrit.wikimedia.org/r/531986 (https://phabricator.wikimedia.org/T223216) (owner: 10RobH) [23:39:10] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:30] (03CR) 10RobH: [C: 03+2] db2034 prod dns decom [dns] - 10https://gerrit.wikimedia.org/r/531987 (https://phabricator.wikimedia.org/T223216) (owner: 10RobH) [23:42:12] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:42:14] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:42:26] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10RobH) [23:42:39] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10RobH) a:05RobH→03Papaul [23:52:42] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:53:50] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status