[02:44:02] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:05:52] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:35:38] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 111.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:59:22] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 35.59 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [06:08:44] DannyS712: that gerrit task may be T231827 [06:08:44] T231827: Data too long for column 'si_title' - https://phabricator.wikimedia.org/T231827 [06:09:02] i'm logged out of gerrit, again [06:15:26] you submitted a different patch for that task [06:16:16] Slightly different P user here [06:17:03] oops, sorry, assumed it was Paladox [06:17:05] :P [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200524T0700) [07:41:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:46:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:48:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:47:06] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [11:03:24] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [11:59:16] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:02:52] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 63.05 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:55:17] (03PS4) 10Zoranzoki21: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) [13:10:18] (03PS1) 10RhinosF1: Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 [13:15:57] (03PS2) 10RhinosF1: Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) [13:16:48] (03CR) 10jerkins-bot: [V: 04-1] Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) (owner: 10RhinosF1) [13:22:48] (03PS3) 10RhinosF1: Add autoreviewrestore into the rollbacker group in hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598245 (https://phabricator.wikimedia.org/T252986) [13:59:44] (03PS2) 10Alexandros Kosiaris: Remove install[12]002 site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/598071 (https://phabricator.wikimedia.org/T224576) [13:59:46] (03PS1) 10Alexandros Kosiaris: kubernetes2007-2014: Move to role kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/598254 (https://phabricator.wikimedia.org/T252185) [14:01:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove install[12]002 site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/598071 (https://phabricator.wikimedia.org/T224576) (owner: 10Alexandros Kosiaris) [14:02:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes2007-2014: Move to role kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/598254 (https://phabricator.wikimedia.org/T252185) (owner: 10Alexandros Kosiaris) [14:15:57] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes... [15:16:17] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes2007.codfw.wmnet', 'kubernetes2011.codfw.wmnet', 'kub... [16:32:16] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:43:32] !log restart blazegraph on wdqs1007 [16:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:34] gehel: I was about to ask if it was the case, perfect timing :) [16:44:56] !log depool wdqs1007 to catch on lag [16:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:00] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [16:45:19] elukey: yeah, looks like it was stuck for some reason (we have a known deadlock issue under investigation) [16:45:42] elukey: thanks for checking! [16:45:48] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 1.15e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:45:52] Daimona: so the xss is not really an xss? [16:45:53] * gehel goes back to dinner [16:46:01] hauskatze: Nope [16:46:07] Just a false positive [16:46:20] Daimona: okay, I'll 'declasify' the task then [16:46:31] WFM [16:47:56] fatto [16:49:40] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:49:55] 10Operations, 10ops-eqiad, 10Analytics: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) I submitted a ticket with Dell for a replacement CPU. SR1025619583 [17:36:36] !log restarting elasticsearch psi on elastic1052 [17:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:20] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [18:02:32] 10Operations, 10Privacy Engineering, 10Research, 10Traffic, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10bmansurov) Just a little status update: I've removed YUI and working on upgrading bootstrap. Since a lo... [18:31:45] Can someone approve my https://meta.wikimedia.org/wiki/Special:OAuthListConsumers/view/7a7e6f1134bfe9964faf509195858ab8 OAuth App req? [18:32:43] done [18:34:26] Reedy: Thank you very much for your prompt response. [19:52:08] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 78 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:58:00] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:19:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:00] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:08] (03PS1) 10Alexandros Kosiaris: kafka-dev: Drop redundant YAML doc starts [deployment-charts] - 10https://gerrit.wikimedia.org/r/598279 [21:23:10] (03PS1) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [22:08:40] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/598038 (owner: 10Elukey) [22:25:21] (03PS1) 10Brian Wolff: Include unzip package on extdist cloud vps tool for composer [puppet] - 10https://gerrit.wikimedia.org/r/598284 (https://phabricator.wikimedia.org/T215713) [22:45:56] (03CR) 10Krinkle: mtail: update varnishrls compatibility with rc35 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [22:45:59] (03CR) 10Krinkle: [C: 03+1] mtail: update varnishrls compatibility with rc35 [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite)