[00:18:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:20:25] ottomata: .. and figured out. Happy reading :) – https://phabricator.wikimedia.org/T249261#6261156 [00:23:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:27:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:43:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 75 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:49:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:50:20] (03PS5) 10Ryan Kemper: maps: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [01:11:14] (03CR) 10Ryan Kemper: "Something I've noticed: `hieradata/role/common/maps` doesn't exist in the "real" private repo, only the public one. I assume there's a rea" [labs/private] - 10https://gerrit.wikimedia.org/r/603975 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [01:14:16] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "Private repo change is done, so time to ship this public labs change" [labs/private] - 10https://gerrit.wikimedia.org/r/603975 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [01:18:43] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "Following merge of the corresponding labs changes, `pcc` is now happy here:" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [01:21:52] (03CR) 10Ryan Kemper: "`puppet-merge` is done here. On Monday I'll circle back to remove the old slave files from labs public+private." [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [01:58:34] (03CR) 10Krinkle: [C: 03+1] "LGTM. Also quite non-destructive, easy to revert in the event anything does happen." [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [01:59:37] (03CR) 10Krinkle: [C: 03+1] "Assuming this is cherry-picked on beta, tag it "beta-cherry-picked", and may want to run puppet compiler + link as well prior to SRE landi" [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [02:36:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:36:14] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10Krinkle) [02:36:55] 10Operations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Prevention): monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10Krinkle) [02:38:13] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10Krinkle) I believe this is resolved now, right? Or Is the still user-facing impact in the form of served cache objects significantl... [02:41:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:58:20] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 107.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [03:49:14] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [05:03:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:05:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:47:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:48:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200627T0700) [07:10:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 72 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:16:14] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:20:38] 10Operations: Stay logged in doesn’t work, global login doesn’t work on different projects - https://phabricator.wikimedia.org/T256525 (10Ferdi2005) [08:30:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:32:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:37:54] 10Operations: Stay logged in doesn’t work, global login doesn’t work on different projects - https://phabricator.wikimedia.org/T256525 (10Wiki13) Part of this issue sounds like T252236, where I describe that newer browser versions severely break CentralAuth, resulting in no global login... [08:47:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:57:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:14:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:54:29] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) >>! In T256444#6261251, @Krinkle wrote: > I believe this is resolved now, right? Or Is there still user-facing impact in the f... [09:56:56] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:02:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:03:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:08:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:10:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:21:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:28:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:37:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:27:08] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition=3 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging [11:27:08] All&var-consumer_group=All [11:33:36] (03PS1) 10Hamish: Set $wgForceUIMsgAsContentMsg for Chinese Wikiquote, Wiktionary, Wikinews, Wikisource, Wikiversity and Wikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608124 (https://phabricator.wikimedia.org/T256521) [11:34:26] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:43:16] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10Krinkle) [11:54:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:00:12] (03PS1) 10QChris: Upload artifacts for Gerrit v3.2.2-97-gcaf5020db1 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608127 [12:00:14] (03PS1) 10QChris: Drop hooks plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608128 [12:00:16] (03PS1) 10QChris: Drop webhooks plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608129 [12:00:19] (03PS1) 10QChris: Add gerrit1002 to targets [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608130 [12:03:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:10:46] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+promet [12:10:46] ter=logging-eqiad&var-topic=All&var-consumer_group=All [12:12:15] (03PS2) 10QChris: Switch to artifacts for Gerrit v3.2.2-97-gcaf5020db1 deployment [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608127 [12:12:17] (03PS2) 10QChris: Drop hooks plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608128 [12:13:06] (03PS2) 10QChris: Drop webhooks plugin [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608129 [12:13:07] (03PS2) 10QChris: Add gerrit1002 to targets [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608130 [12:21:38] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:26:09] (03PS1) 10QChris: Migrate gerrit1001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/608136 [12:26:11] (03PS1) 10QChris: Migrate gerrit1001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/608137 [12:26:39] (03PS2) 10QChris: Migrate gerrit2001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/608137 [12:29:45] (03PS2) 10QChris: gerrit: As log4j.xml is a static file, treat it as static file [puppet] - 10https://gerrit.wikimedia.org/r/608097 [12:29:47] (03PS2) 10QChris: gerrit: Adapt log4j config to catch gc_log messages for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/608098 [12:29:49] (03PS2) 10QChris: Migrate gerrit1001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/608136 [12:29:51] (03PS3) 10QChris: Migrate gerrit2001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/608137 [12:42:56] (03CR) 10QChris: [V: 03+2 C: 03+2] "Self-merging to prepare for Gerrit v3.2.2-97-gcaf5020db1 deployment" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608127 (owner: 10QChris) [12:43:21] (03CR) 10QChris: [V: 03+2 C: 03+2] "Self-merging to prepare for Gerrit v3.2.2-97-gcaf5020db1 deployment" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608128 (owner: 10QChris) [12:43:35] (03CR) 10QChris: [V: 03+2 C: 03+2] "Self-merging to prepare for Gerrit v3.2.2-97-gcaf5020db1 deployment" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608129 (owner: 10QChris) [12:43:49] (03CR) 10QChris: [V: 03+2 C: 03+2] "Self-merging to prepare for Gerrit v3.2.2-97-gcaf5020db1 deployment" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608130 (owner: 10QChris) [13:03:50] !log qchris@deploy1001 Started deploy [gerrit/gerrit@460e439]: Gerrit to v3.2.2-97-gcaf5020db1 on gerrit1002 (gerrit-test) [13:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:58] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@460e439]: Gerrit to v3.2.2-97-gcaf5020db1 on gerrit1002 (gerrit-test) (duration: 00m 08s) [13:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:26:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:39:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:44:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:53:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:04:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:33:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:35:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:00:16] (03PS1) 10Ladsgroup: piwik: Remove "slave" from comment [puppet] - 10https://gerrit.wikimedia.org/r/608157 (https://phabricator.wikimedia.org/T254646) [15:11:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:24:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:35:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:35:53] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to 3.3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [15:37:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:38:33] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to 3.3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) After talking to @herron we decided that we start the upgrade (slowly) and hopefully we will get it deployed and upgrade in a couple of months (maybe a year).... [15:40:43] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) [15:41:06] (03PS1) 10QChris: Bump gerrit.war to Gerrit v3.2.2-98-g98d827eaa3 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608158 [15:41:29] 10Operations, 10Wikimedia-Mailing-lists: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) [15:42:53] (03CR) 10QChris: [V: 03+2 C: 03+2] "Self-merging to prepare for Gerrit v3.2.2 deployment" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/608158 (owner: 10QChris) [15:44:28] !log qchris@deploy1001 Started deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1002 (gerrit-test) [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:36] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1002 (gerrit-test) (duration: 00m 08s) [15:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:50:13] (03CR) 10Paladox: [C: 03+1] gerrit: As log4j.xml is a static file, treat it as static file [puppet] - 10https://gerrit.wikimedia.org/r/608097 (owner: 10QChris) [15:50:26] (03CR) 10Paladox: [C: 03+1] gerrit: Adapt log4j config to catch gc_log messages for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/608098 (owner: 10QChris) [15:53:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:06:51] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) [16:07:27] Just a heads up, that in about an hour, our Gerrit will get an upgrade and will hence be down for a bit. See https://lists.wikimedia.org/pipermail/wikitech-l/2020-June/093526.html [16:10:07] 10Operations, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Ladsgroup) [16:10:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:17:36] 10Operations, 10Wikimedia-Mailing-lists: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Ladsgroup) [16:18:58] 10Operations, 10Wikimedia-Mailing-lists: Puppetize mailman3 web and hyperkitty (mailman archiver) - https://phabricator.wikimedia.org/T256542 (10Ladsgroup) [16:32:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:39:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:41:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:48:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:49:04] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:55:38] qchris: mutante: I am around ;] [16:55:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:56:07] hashar: Awesome! :-) [16:56:28] Hello everyone! Thank you for upgrading Gerrit! :) [16:57:26] spam of Invariant failed: Bad UTF-8 at end of string (2 byte sequence) bah [16:57:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:59:09] Hi Zoranzoki21 :-D [17:00:04] qchris and mutante: May I have your attention please! Gerrit v3.2 upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200627T1700) [17:00:29] Good luck [17:00:46] here [17:00:52] * thcipriani waves from sidelines [17:01:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:01:36] (03PS1) 10Ladsgroup: mailman3: Start mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) [17:01:57] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Start mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [17:01:58] * Zoranzoki21 wants good luck [17:02:38] * RhinosF1 gives Zoranzoki21 good luck as well even though he doesn't know what for [17:03:02] RhinosF1: For upgrading Gerrit, I know :) [17:03:06] No worry [17:03:08] After ~40 commits, its-phabricator is again in good shape. It now has [17:03:13] well done Qchris ;] [17:03:24] :-D [17:03:32] (03PS2) 10Ladsgroup: mailman3: Start mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) [17:03:39] what do we do? [17:06:14] cheer appreciatively upon completion :) [17:06:17] hashar: we start by merging 2 logging changes [17:06:24] (03CR) 10Dzahn: [C: 03+2] gerrit: As log4j.xml is a static file, treat it as static file [puppet] - 10https://gerrit.wikimedia.org/r/608097 (owner: 10QChris) [17:06:26] Sounds like a plan [17:06:48] the last step will be rebooting hardware [17:07:00] !log Starting Gerrit upgrade to v3.2.2-98-g98d827eaa3 [17:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:25] /var/lib/gerrit2/review_site/etc/log4j.xml]/mode: mode changed '0444' to '0755' [17:08:40] Yup. That's expected as noted in Gerrit. [17:08:41] no change to the content [17:08:43] (on the change) [17:08:45] Cool. [17:08:52] on all 3 servers [17:08:54] next [17:09:15] mutante: I'm not authorized to schedule downtime for gerrit1001 and gerrit2001. Could you please schedule 5h of downtime for me? [17:09:17] (03CR) 10Dzahn: [C: 03+2] gerrit: Adapt log4j config to catch gc_log messages for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/608098 (owner: 10QChris) [17:09:40] ack, using the cumin cookbook for it [17:10:09] i'm also here :) [17:11:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:11:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:27] sudo cookbook sre.hosts.downtime -r version_upgrade -t T254158 -H 5 gerrit1001.wikimedia.org [17:11:28] T254158: Gerrit 3.2 upgrade - https://phabricator.wikimedia.org/T254158 [17:11:35] !log Disabling puppet on gerrit1001 for Gerrit upgrades + data migrations [17:11:35] Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: gerrit1001.wikimedia.org [17:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:42] heh [17:11:45] I'm around if I can be of help [17:11:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:11:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:11] also downtimed gerrit2001 (?) [17:12:22] Yup, both. [17:12:27] We need to upgrade both. [17:12:30] thanks paladox and reedy :) [17:12:32] qchris: done [17:12:50] Thanks. [17:13:20] https://gerrit.wikimedia.org/r/q/topic:%22separate-gerrit-3x-config%22+(status:open%20OR%20status:merged) [17:13:44] applying the gc_logging change [17:13:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:14:14] qchris: puppet is already disabled on 1001 but the last chnage was not applied yet [17:14:16] !log Duplicating reviewdb changes so we get a cheap and quick rollback [17:14:18] on the other 2 servers it is now [17:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:39] Argh. I thought you had it apllied already. I'll have to do a manual run later anyways. [17:14:44] Thanks for the heads-up. [17:14:46] well, on 2001 and ok [17:16:31] ah, 2 more icinga checks had not been covered by that cumin command.. because they are attached to virtual host "gerrit.wikimedia.org" and not gerrit1001. downtime added manually [17:16:50] when Gerrit is down, there are a bunch of puppet manifest that will start failing all other the places. Eg when using git::clone() {} [17:17:05] but I don't have a list handy so I guess we will have to hack them one by one :-\ [17:17:23] (they are 'Gerrit Health Check' and 'Gerrit JSON') [17:17:49] hashar: ack, that's going to be expected. it's all hosts using git::clone in puppet [17:18:02] we will run puppet on them and reschedule service check in icinga so it goes faster [17:18:19] could use the compiler to find some .. hmm [17:19:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:19:55] !log Stopping gerrit on gerrit1001 for the Gerrit upgrade [17:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:15] Now we'll soon find out what will break if gerrit is down :-/ [17:21:01] a couple analytics hosts.. puppet run on miscweb .. and .. [17:21:35] Anything worth stopping the upgrade? (Nothing happened yet, I'm still taking backups) [17:22:10] qchris: no, go ahead. it's only monitoring noise and i am ready to minimize it [17:22:16] Ok. Cool :-) [17:22:20] got icinga tab open [17:23:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:23:40] and there is only one change in CI so .. :] [17:23:58] PROBLEM - Check the last execution of git_pull_charts on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:24:44] PROBLEM - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:50] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:00] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:01] !log Disabled beta cluster update job (gerrit maintenance) https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ [17:25:02] those are what we were talking about [17:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:38] PROBLEM - Check the last execution of git_pull_charts on contint2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:25:44] ACKNOWLEDGEMENT - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:44] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on contint1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:25:44] ACKNOWLEDGEMENT - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:44] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on contint2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:25:44] ACKNOWLEDGEMENT - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:45] ACKNOWLEDGEMENT - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:45] ACKNOWLEDGEMENT - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:46] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:26:40] PROBLEM - Check the last execution of git_pull_httpbb on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:26:54] ACKNOWLEDGEMENT - Check the last execution of git_pull_httpbb on cumin2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:26:54] ACKNOWLEDGEMENT - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:54] ACKNOWLEDGEMENT - Check the last execution of git_pull_charts on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:26:54] ACKNOWLEDGEMENT - Check the last execution of git_pull_httpbb on deploy2001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:27:03] Backing up data is done. Last point of possible stop before the data munging starts. [17:27:09] mutante: Ok to continue? [17:28:10] PROBLEM - Check the last execution of git_pull_httpbb on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:28:18] qchris: yes [17:28:24] Thanks. [17:28:52] I guess I do not need to log every step in the migration. I logged the start. And I'll log the end :-) [17:29:09] ACKNOWLEDGEMENT - Check the last execution of git_pull_httpbb on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:29:34] I guess it is enough to chat here [17:29:47] if need be, we can just retrieve the history of actions from the IRC chat logs [17:29:49] unless you want it to document steps [17:30:58] Ok. Thanks. [17:35:40] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.0107 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:35:56] mutante: one more to hack [17:36:56] ACKNOWLEDGEMENT - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.0107 ge 0.01 daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:37:16] done. they will tell us again when they are back to OK but not keep repeating it until then [17:40:52] qchris: do you back up the git repositories before upgrading [17:40:53] ? [17:41:09] Yup. Database, git repos and lfs data [17:41:44] That has happened already. [17:42:03] Data migration to 2.16 just finished. [17:42:08] On to 3.0 :-) [17:43:08] nice!:) [17:43:09] Migration to 3.0 done [17:43:38] PROBLEM - Check the last execution of git_pull_httpbb on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:43:49] Migration to 3.1 done [17:43:57] 3.2 is the part that takes longest ... [17:44:56] ACKNOWLEDGEMENT - Check the last execution of git_pull_httpbb on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_httpbb daniel_zahn gerrit upgrade https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:49:30] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:30] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:26] downtimed contint checks [17:50:31] on contint2001 that is due to the git_pull_charts [17:51:21] if they are flapping for some reason the ACK doesnt help, ACK [17:52:20] kind of suprised we have not had someone join the channel to ask if gerrit is down. Saturday. [17:52:53] :-) [17:53:36] Gerrit supports zero-downtime upgrade from Gerrit v3.1.6 (or later) ;] [17:53:38] I've been here for a while and am aware of the maintenance, but is Gerrit down? [17:53:54] Majavah: yes :) [17:54:00] :-D [17:54:09] but that needs an high availability setup ;) [17:54:11] hashar: Yes, but you need a multi-site setup. [17:54:11] hashar: cool! [17:54:16] Exactly. [17:54:23] mutante: wait your message requires me to join the channel first, brb [17:54:36] let's do that some time with the one in codfw [17:54:44] hello, is Gerrit down? [17:54:46] lol [17:54:54] yup [17:55:46] not needing the mysql database anymore is going to be nice :) [17:55:52] ^ [17:55:53] and in the right direction [17:56:31] notedb still seems scary to me :) [17:56:32] we already have the ticket to tear down reviewdb with dbas [17:56:32] Read the NoteDB code and say that again :-) [17:56:45] oh, heh [17:56:58] I'd much rather have a proper DB (even MySQL/MariaDB) than NoteDB. [17:57:18] notedb is really neat, like as a though experiment :P [17:57:18] But meh. Here's the glorious future :-D [17:57:20] *thought [17:58:07] And without seeing messages like "Don't migrate to NoteDB on 2.15 or you'll run into issues at some later point, because migration code was broken on 2.15" [17:58:31] (in the official change logs that ^ is) [17:58:44] not terrifying at all [17:59:24] "don't upgrade in the past because in the future we broke that" [17:59:45] 😟 [18:00:18] Meh. What could possibly go wrong :-) [18:00:26] thanks for updating the IRC topic, btw [18:01:09] /home/qchris/run-logged.sh is neat ;) [18:01:30] though well one can use script -t logging.txt [18:01:31] ;] [18:01:42] Hahaha :-) A poor man's screen. [18:01:43] oh ! /me spies [18:01:57] so yeah we can just follow /home/qchris/upgrade.log [18:02:18] tail -f's :) nice [18:02:25] good to see your slices reindexing outputs progress [18:02:28] reindexing ongoing [18:02:46] * qchris feels whatched :-D [18:02:58] tail -F /home/qchris/upgrade.log |egrep -v '^Reindexed change' [18:03:07] those project-slices are your patch qchris isn't it ? [18:03:12] Yup. [18:03:13] I mean the patch you wrote to split it [18:03:22] That helps a bit :-) [18:03:28] had two stacktrace emitted [18:03:37] [2020-06-27 18:03:10,935] [Index-Batch-12] WARN com.google.gerrit.server.change.ChangeKindCacheImpl : Cannot check change kind of new patch set 5ab9c26006ca4c1c17316a99707def15fc5fa6b6 in operations/software [18:03:38] java.util.concurrent.ExecutionException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown f3c351f9c23337441a253d65188362c3f308d5d6 [18:03:39] There are a few in the process. [18:03:50] [2020-06-27 18:03:11,076] [Index-Batch-12] WARN com.google.gerrit.server.change.ChangeKindCacheImpl : Cannot check change kind of new patch set ced15d8e61e963fc9d08fe9a7bbd8656d58ac558 in operations/software [18:03:50] java.util.concurrent.ExecutionException: org.eclipse.jgit.errors.MissingObjectException: Missing unknown f3c683a6122536f01185f4ea68b749d1867a37fa [18:03:58] Changes that are broken in git and/or the database. [18:04:02] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:13] There's not much you can do about it. [18:04:37] maybe changes we have deleted [18:04:43] When testing the upgrade on gerrit-test, I took a whole lot of these Change Ids and tried them on the production gerrit. And the changes failed to load there too. [18:05:37] the ones of those I've run down are typically like comments on changes or patchsets in changes that are pointed at by the git dag, but missing. Typically for very old changes. [18:06:11] this is part of the problem with using git as a database. [18:06:18] Yup. [18:11:40] PROBLEM - Check the last execution of git_pull_charts on contint1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:12:02] Reindexing done :-) [18:12:22] Now for bringing it up manually... [18:12:26] -- RUNLOGGED stats ^^^ took 0 hours 26 minutes 45 seconds [18:12:30] that is hmm [18:12:31] fast [18:15:47] maybe we should think about granting access to cumin hosts and (selected) cookbooks for people who have shell on more than one machine. [18:16:35] "on all gerrits" is just 3 for now, but there are other clusters [18:17:48] though in this case there aren't that many commands that would really have to run on all, at least not at the same time [18:18:44] that would be super nice [18:19:05] I just use dsh locally or some home made shell scripts ;D [18:19:16] RECOVERY - Check systemd state on deploy2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:16] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:23] then as I understand it, cumin just assumes you get root everywhere [18:19:28] i could imagine doing some access groups that get certain (default) cookbooks but not all [18:19:30] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:32] mutante: Could you please merge the gerrit1001 -> is_new_version change? [18:19:37] qchris: ACK [18:19:42] RECOVERY - Check the last execution of git_pull_charts on contint2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:19:56] Poor icinga-wm Soon you'll have to report failure of gerrit-relying jobs again :-( [18:20:12] (03CR) 10Dzahn: [C: 03+2] Migrate gerrit1001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/c/operations/puppet/+/608136 (owner: 10QChris) [18:20:21] Thanks. [18:20:28] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:29] mutante: good idea, put it in our post upgrade todos :) [18:20:29] If possible, let me do the puppet run. [18:20:36] mutante: ^ [18:20:42] RECOVERY - Check the last execution of git_pull_httpbb on deploy2001 is OK: OK: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:20:48] * greg-g peeks in from the hardware store [18:20:58] qchris: doing the merge on the puppetmaster right now.. then letting you run the agent [18:21:04] Thanks! [18:21:09] puppetmaster sync running [18:21:32] and yea, i hit submit with 3.2 and looking different, yay [18:21:35] qchris: go ahead [18:21:44] Thanks! [18:22:02] greg-g: will do [18:22:12] RECOVERY - Check the last execution of git_pull_httpbb on cumin1001 is OK: OK: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:22:28] RECOVERY - Check the last execution of git_pull_charts on contint1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:23:08] how do I change back to the old UI? /sarcasm [18:23:29] :-D [18:24:02] i did not even get logged out! [18:24:04] For some reason i cannot view plugins, and the javamelody (https://gerrit.wikimedia.org/r/admin/plugins). But i can on gerrit-test. [18:24:42] paladox: plugin-sync still to come? [18:24:47] I got message in HTML format :D [18:25:15] oh [18:25:31] ah, yea, it's the maintenance message but in a pop-up now [18:26:08] paladox: patience.. it is restarting :) [18:26:33] Yes, I see. https://prnt.sc/t7l3st [18:26:45] And when I click on "refresh credentials" I get pop-up.. [18:27:19] !log qchris@deploy1001 Started deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1001 [18:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:25] Zoranzoki21: cause gerrit is down right now [18:27:27] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit1001 (duration: 00m 08s) [18:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:35] Zoranzoki21: yea, i got that too. but just for a moment and I just hit dismiss. and that was it. it shouldn't have been raw HTML though [18:27:36] hashar: Yes, I'm tracking this.. [18:27:40] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:40] mutante: https://phabricator.wikimedia.org/T244840 for context [18:27:45] Zoranzoki21: so you get a 503 (server side error) when reaching out the API ;] [18:27:49] * kostajh waves and says a big thank you to everyone doing this work! [18:28:00] hashar: Thanks [18:28:01] volans: oh! thanks [18:28:14] subscribing [18:28:18] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:20] PROBLEM - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:32] "Plugin install error: TypeError: self.onAction is not a function. (In 'self.onAction('project', 'delete', onDeleteProject)', 'self.onAction' is undefined) from https://gerrit.wikimedia.org/r/plugins/delete-project/static/delete-project.js" strange [18:28:32] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:45] that is definitely fix [18:28:47] Plugin install error: TypeError: self.onAction is not a function from https://gerrit.wikimedia.org/r/plugins/delete-project/static/delete-project.js [18:29:28] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:08] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:10] RECOVERY - Check systemd state on deploy2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:30:20] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:45] Meh I'll remove the delete-project plugin for now. I did not see this on gerrit-test ... [18:30:52] It says the version of the plugin is "v3.2.2" but this is definitely fix. [18:31:21] *fixed [18:31:33] Flushing caches and reloaded fixed the issue for me. [18:31:46] Not removing the delete-project plugin then :-) [18:32:36] oh [18:32:37] yeh [18:32:43] gerrit1001 should be operational from now on. [18:32:54] delete-project is not a big deal imho [18:33:05] Letting it sit for a bit while I upgrade gerrit2001 before declaring victory. [18:33:05] that is just for admins and we can suffer for it to be disabled for a bit I guess [18:33:53] * mutante adds token wikitech barnstar. thanks qchris! [18:34:05] Hahaha :-D Thanks. [18:34:24] FWIW, hard refresh fixed the "delete-project" message for me. Cached client js, I guess(?). [18:34:57] !log qchris@deploy1001 Started deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit2001 [18:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:07] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@da40615]: Gerrit to v3.2.2-98-g98d827eaa3 on gerrit2001 (duration: 00m 10s) [18:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:15] thcipriani yeh :/ [18:35:28] then I guess I will need to find how to bring colored CI messages again ;] [18:35:32] wfm :)) [18:35:55] at least CI receives events and manage to report back as intended [18:36:34] New gerrit is responsive [18:36:54] the "has draft comments" part shows up and brings me all this forgotten stuff [18:36:58] hehe [18:37:20] https://usercontent.irccloud-cdn.com/file/ODrtQaTp/IMG_6052.PNG [18:37:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:37:39] "CCed on" as a separate section is also nice [18:37:40] RECOVERY - Check the last execution of git_pull_httpbb on deploy1001 is OK: OK: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:39:36] RECOVERY - Check the last execution of git_pull_charts on deploy2001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:39:40] paladox: how to abandon a draft comment? [18:39:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [18:40:32] paladox: shall I file the bug I just found with upstream or WMF? [18:41:13] hrm, also have to figure out how to make pipelinebot messages pretty again (in addition to jenkins-bot messages) [18:41:26] paladox: nevermind, figured it out. [18:41:48] RhinosF1: upsteam [18:41:58] depends what type of bug it is [18:42:10] mutante: not responsive properly [18:42:18] See the screenshot I posted [18:42:31] thcipriani: the pipeline report no more show the nice badge ( https://gerrit.wikimedia.org/r/c/blubber/+/608166 ) [18:42:32] I have same one, thanks RhinosF1 [18:42:44] thcipriani: but that is hardly a defect. Just some minor thing we need to tweak later on I guess [18:43:24] RhinosF1: you can start with WMF and we can figure it out from there. [18:43:34] mutante: ack [18:44:02] hashar: yeah, the repos where pipelinebot does comment, it's not formmated all fancy anymore, but that's not too bad. [18:44:22] then that blubber change does not have a pipelinebot comment bah [18:44:28] I guess something changed in commentlink [18:44:34] yeah [18:44:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:45:22] mutante: https://phabricator.wikimedia.org/T256547 [18:45:34] abadoning comments i made on gerrit changes in 2014 that now show up as drafts and never noticed them before :) [18:45:52] hashar: I don't think blubber has pipelinebot comments except on merge (cf: https://gerrit.wikimedia.org/r/c/blubber/+/580167/ ) [18:46:07] ahh [18:46:12] RhinosF1: ack, i see. mobile view [18:46:28] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005035 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:48:17] ok .. so, looks like we now have the new gerrit! which is cool! but, i just now did a git-review of an updated PS for an patch in there, and it never got uploaded ..https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/607098 still shows the old version. I have git-review 1.26.0 [18:49:16] !log Enabling beta cluster update job (gerrit maintenance) https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ [18:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:33] RhinosF1: i took the liberty to rename it a little, i know you mean responsive design but "not responsive" can also sound like "ping timeout" type of issue [18:49:58] subbu: You need at least git review 1.27 :-( [18:50:03] subbu: git review 1.27 is, I think, the version needed to use the refs/for/* push end point. Older versions use refs/publish which was removed. [18:50:23] i see ... gotta figure out how to upgrade it ... thanks. [18:50:25] subbu: the last is fecfdd952ec65c9f5d502c1cb849f3eddf9859bf refs/changes/98/607098/5 [18:50:38] but yeah get the latest git-review for sure ;] [18:50:46] RhinosF1 which bug? [18:50:50] mutante: Could you please merge the gerrit2001 -> is_new_version change. [18:51:06] qchris: on it [18:51:19] subbu: probably just about: pip3 install --user git-review [18:51:27] mutante: Thanks! [18:51:34] (03CR) 10Dzahn: [C: 03+2] Migrate gerrit2001 to Gerrit v3.2.2-97-gcaf5020db1 [puppet] - 10https://gerrit.wikimedia.org/r/c/operations/puppet/+/608137 (owner: 10QChris) [18:51:36] `Gerrit.post('/accounts/self/drafts:delete', {query: '-is:open'})` in the JS console clears ooold comments :D [18:51:37] subbu: and you should then have an entry point at ~/.local/bin/git-review [18:51:54] " Plugin install error: TypeError: self.onAction is not a function from https://gerrit.wikimedia.org/r/plugins/delete-project/static/delete-project.js " [18:51:56] is annoying to [18:52:04] Reedy hard refresh [18:52:09] didn't somebody recently mention a minimum version for git-review? [18:52:09] I have [18:52:30] qchris: you can run puppet now [18:52:38] Thanks! [18:53:00] git-review 1.26 pushes to refs/publish/xxxxxx which got removed in Gerrit 3 in favor of the old refs/for/xxxxx [18:53:29] hmm [18:53:32] then git-review 1.28 has a bunch of nice additions such as finding the proper remote [18:53:37] so yeah folks should just upgrade [18:53:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:56:55] mutante: I think we're done with the main upgrade. Want to reboot the servers? [18:57:37] qchris: great! yes, i do [18:57:41] 2001 first [18:57:48] Reboot at will! :-) [18:57:56] looks like gerrit1001 is replicating bunch of stuff to gerrit2001 [18:58:03] (You can now upload files in the UI) [18:58:08] I assumes that is whatever new notedb entries [18:58:11] * mutante stops [18:58:19] Yes, it is. We ignore this replication. [18:58:24] I rsynced the repos over. [18:58:25] * mutante continues [18:58:28] So these are noops [18:58:31] Yup .Continue. [18:58:38] !log rebooting gerrit2001 [18:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:58] paladox: https://phabricator.wikimedia.org/T256547 [18:59:01] yeah, whenever we start gerrit on 1001 it runs a full replication of all repos to 2001 [18:59:16] but if they're all in sync, meh [18:59:16] mutante: ty [19:00:52] gerrit2001 is up again [19:00:53] qchris: gerrit2001 is back at login. let's check. [19:01:02] it showed one "Failed" for apache during boot [19:01:08] but then that fixes itself [19:01:16] checks [19:01:43] Gerrit itself looks good. [19:01:47] (on gerrit2001) [19:01:48] yea, apache is running [19:02:55] And cloning through http works. [19:03:09] So Internet -> apache -> gerrit works [19:03:17] On to gerrit1001? [19:03:17] so far I dont see any issue [19:03:29] hashar: \o/ [19:03:38] I wanna check the gerrit replica [19:03:47] icinga for gerrit2001 all green except 1 unknown left.. clearing it [19:04:10] That's an nrpe timeout, isn't it? [19:04:32] it is. on with 1001 [19:04:53] green now. moving on [19:05:04] Hahah. I cannot even schedule a re-check in our Icinga "Not Authorized" ... but at least it lets me see the pages. [19:05:17] Ok. On to gerrit1001 [19:05:26] qchris: maybe you can if you are Qchris instead of qchris, we can check another time [19:05:32] !log rebooting gerrit1001 [19:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:38] Oooooh! [19:05:54] yea, it lets you login with different capitalizaton [19:06:04] Is here a glitch in last comment, or it is empty really? https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/580166 [19:06:04] but then icinga privileges must match exact :p [19:06:24] it's us though because we slapped the login in front of it [19:07:06] PowerEdge R440 booting ... [19:07:08] PROBLEM - Host gerrit.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:07:17] Zoranzoki21: gerrit1001 is getting rebooted. Let's re-try in a few minutes. [19:07:22] that got solved by normalizing all usernames to lower cases [19:07:29] there is some magic java script we had to run for that [19:07:39] You hackers! [19:07:40] + some case sensitiveness to disable in the ldap config [19:07:41] (afaik) [19:07:51] I can't remember the exact details [19:07:52] qchris: gerrit1001 is back at login [19:07:54] ACKNOWLEDGEMENT - Host gerrit.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn upgrade [19:08:02] Thanks mutante [19:08:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:08:54] that alert wasn't supposed to happen, it did not page though. (maybe it should when gerrit.wm.org is down) [19:10:10] Mhmm. gerrit1001 does not let me ssh in. [19:10:45] qchris: i just got on it [19:11:07] try again while i watch the log [19:11:08] cant reach it either :-\ [19:11:15] Ok. Thanks. [19:11:30] passing through bast1002.wikimedia.org [19:11:42] I'm using bast1002 too. [19:11:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:12:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:12:20] my config says to not use a bastion because it's public [19:12:30] looks at bast1002 [19:13:02] I can ssh to the Gerrit interface at gerrit.wikimedia.org [19:13:03] gerrit1001 has iptables rules to allow bast1002 and other bastions [19:13:08] I can get in, FWIW [19:13:09] I reach gerrit1002 just fine [19:13:43] I'm using bast..4002? [19:14:13] cant reach it via bast4002 [19:14:13] bah [19:14:42] When I ssh to gerrit1001.wikimedia.org through bast1002 the IP address 208.80.154.136 gets used. gerrit.wikimedia.org resolves to 208.80.154.137 (last octet is 136 vs 137) [19:14:47] well I can't reach bast4002.wikimedia.org [19:15:28] oh [19:15:57] .136 is the server IP and .137 is the service IP [19:16:04] gerrit1001 has them both on the interface [19:16:12] Ok. [19:16:34] but it also has 2620:0:861:3:208:80:154:87 wtf [19:17:30] Neither of the two IPv4 addresses has port 22 open to my ISP. [19:18:25] editing /etc/network/interfaces on gerrit1001 and removing 2620:0:861:3:208:80:154:87 which resolves to idp-test1001 [19:19:09] from bast1002 I can ssh -4 just fine [19:19:12] but not ssh -6 [19:19:14] !log rebooting gerrit1001 one more time [19:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:16] not sure whether that is any related [19:19:58] the IP that now resolves to idp-test1001 - was that a previous name of gerrit or service? [19:20:46] na idp that is the apero cas [19:20:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:20:54] http://idp.wikimedia.org/ [19:20:54] * qchris is glad that we waited with the rebooting of gerrit1001 until we have a maintenance window :-) [19:21:24] hashar: any better? [19:21:29] the ssh [19:21:48] session opened on bast1002 [19:22:02] qchris: oh man, so true [19:22:21] qchris: what happens if you use no bastion at all [19:22:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:23:01] I don't think we can [19:23:12] ssh should be firewalled for anything but from the bastion/cumin [19:23:24] Port 22 is not open to the world (on none of 208.80.154.136 208.80.154.137) [19:24:30] I'm in. [19:24:34] Through bast1002. [19:24:43] !log restarted ferm on gerrit1001 [19:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:16] ACCEPT tcp -- gerrit1001.wikimedia.org anywhere tcp dpt:ssh [19:25:59] qchris: can you ssh to the gerrit ssh on 29418? [19:26:22] yes [19:26:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:26:36] gerrit web interface is back for me [19:27:00] Now that it works ... should we try to reboot again to make sure your fixes (whatever you did) are permanent? [19:27:22] ok [19:27:52] !log rebooting gerrit1001 one more time [19:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:57] qchris: ok, try ssh [19:30:20] Gerrit ssh port works. [19:30:22] still broken for me [19:30:25] :-\ [19:30:43] Port 22 still hangs for me too. [19:31:16] are you doing ssh to gerrit.wm.org? [19:31:20] or to gerrit1001.wm.org [19:31:38] Port 22 hangs on both. [19:31:42] (on IPv4) [19:31:45] gerrit1001.wikimedia.org [19:31:55] with ssh -4 forced in the ProxyCommand and on the command line [19:31:58] try again, did anything change? [19:32:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:32:02] i restarted ferm again [19:32:43] No change for me. Port 22 still hangs on both IPs and through bast1002. [19:32:44] sshd is listening on all interfaces per netstat [19:33:08] FWIW, I can do ssh -4 or -6 and both work via bast4002 [19:33:33] web interface is also back again for me [19:33:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:52] some weird TTL / MTU issue maybe :-\ [19:33:56] Maybe a stupid question ... we do want this to work through bast1002, right? [19:34:02] es [19:34:04] hashar: It worked before for me. [19:34:07] * thcipriani tries switching bastions [19:34:11] (through bast1002) [19:34:20] Now I'm in. [19:34:23] Through bast1002 [19:34:25] ACCEPT tcp -- bast1002.wikimedia.org anywhere tcp dpt:ssh [19:34:54] so the one thing that i actually changed was removed that extra IP from /etc/network/interfaces [19:35:01] and then it might have just taken some time [19:35:37] That was ~150 seconds for me to log in. [19:35:56] the output of "ip a s" on gerrit1001 looks right to me [19:36:05] it has gerrit.wikimedia.org and gerrit1001.wikimedia.org [19:36:07] v4 and v6 [19:36:22] netstat says sshd is listening. iptables says it is allowed from all bastions ... [19:36:27] it works for me.. [19:36:31] Ok :-) [19:36:43] still broken for me :-\ [19:36:45] I'm ok with having to wait 2.5 minutes to log in :D [19:36:48] hrm...yeah, bast1002 hangs connecting to gerrit. I get the host fingerprint for the bastion, now just sitting there. [19:37:14] logging into bast1002, I can nc gerrit1001 port 22 just fine, less than 1 sec to respond. [19:37:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:37:28] proxy command hangs though [19:37:29] I'm in again (Again after ~150 seconds) [19:38:02] and that passes through bast2002.wikimedia.org \o/ [19:38:18] oh, and I'm in via bast1002 finally [19:38:45] but it took a long time [19:39:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:40:04] well iptables drops stuff [19:40:25] i see a session opened for qchris on 1001 [19:40:51] I can get reliably in through bast1002 (3 times now in a row). But it takes ~150 seconds. [19:42:13] It seems now everyone is getting into gerrit1001 again (through different bastions though). Is that good enough to declare victory or do we want to debug further? (If so, just let me know what I should test) [19:43:10] I don't get what would have changed [19:43:15] qchris: one more time? still takes long? [19:43:20] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:43:28] * qchris tries [19:44:18] https://gerrit-review.googlesource.com/c/gerrit/+/273532 kind of fixes the issue RhinosF1 reported earlier [19:44:42] paladox: Nice! [19:45:28] interestingly, ping via ipv6 from bast1002 to gerrit1001 takes a loooong time whereas ipv4 is instant [19:46:21] mutante: Yes. Again ~150 seconds. [19:46:23] ooh.. i see something else [19:46:44] 2620:0:861:3:208:80:154:136 [19:46:46] 2620:0:861:2:208:80:154:136 [19:46:53] see those.. one is right and one is not [19:46:58] :-) [19:47:19] I don't understand anything of what is going on [19:47:26] Use IPv6 they said. Sooooo many new IP addresses they said. And now they are too many :-D [19:47:30] if I ssh -4 from bast1002.wikimedia that works fine [19:47:33] 19 up ip addr add 2620:0:861:2:208:80:154:136/64 dev eno1 [19:47:33] 20 up ip addr add 2620:0:861:3:208:80:154:136/64 dev eno1 [19:47:43] fixing /etc/network/interfaces some more [19:47:56] but doing the proxycommand from my host via bast1002 there is no packet received on the interface [19:48:04] (using v4 [19:49:01] !log removed 2620:0:861:3:208:80:154:136 from /etc/network/interfaces on gerrit1001, rebooting [19:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:24] cant you just restart the network ? [19:51:41] i also want to know for sure it is ok after next reboot [19:51:44] ;] [19:52:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=gerrit site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:52:10] RECOVERY - Host gerrit.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:52:19] now look at that [19:52:40] I am in [19:52:46] inet6 2620:0:861:2:208:80:154:136/64 scope global [19:52:53] it is ok after reboot and stays as it should [19:52:59] hashar: :) [19:53:27] I'm in after ~7 seconds. [19:53:31] so yea, in the past there must have been manually fixing/changing the IP on the interfaces [19:53:35] \o/ [19:53:41] without editing /etc/network/interfaces and rebooting [19:53:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:54:02] this would be a good example why rebooting can go wrong :p [19:54:05] I doon't get why it was broken over v4 though [19:54:09] :) [19:54:31] anyway whatever happened I can ssh to gerrit1001 just fine now [19:54:44] * qchris repeats himself but .... meh : I'm so glad we waited for the maintenance window with the reboot :-D [19:54:51] ^ [19:54:59] very much :) [19:55:02] ;D [19:55:56] mutante: Do we need more reboots, or am I good to clean up Gerrit's replication queue/jobs/whatever? [19:56:04] (It's ok if we need more reboots) [19:56:29] (Better rebooting once to often than too little :-) ) [19:56:57] qchris: enough reboots on gerrit1001. but now let me check gerrit2001 some more about a similar issue [19:57:04] * qchris wants an assistant that can type my messages without typos/grammar errors/etc. [19:57:08] mutante: Ok. Cool. [19:58:07] nah, 2001 looks fine [19:58:11] qchris: go ahead [19:58:18] +1 [19:58:19] Thanks! [19:58:48] maybe we want paladox' change https://gerrit.wikimedia.org/r/c/operations/puppet/+/508657 [19:58:57] it fits with the other log4j stuff from earlier [19:59:19] The log4j json fix is live already. [19:59:25] (If you mean that one) [19:59:59] i mean the one where we remove additivity from gc_log [20:00:06] to avoid writing it twice [20:00:12] https://gerrit.wikimedia.org/r/c/operations/puppet/+/508657/6/modules/gerrit/templates/log4j.xml.erb [20:00:12] Oh. [20:00:33] Let the dust settle on Gerrit a bit. [20:00:40] ack [20:00:44] Let's watch how things go over the weekend and on Monday. [20:00:52] Then we merge the remaining things. Deal? [20:00:55] I let you figure out the log4j stuff. Last time I looked at it I ended up with an XML panic attack [20:01:16] yea, i did not want to add any other things on it today [20:01:32] only reason to say it was because it seemed related to the other log changes [20:02:18] qchris: gerrit down ? heh [20:02:28] Yup. For replication cleanup. [20:02:41] * mutante nods [20:02:51] Looking if I find more relicts from the reboots/restarts [20:03:24] Gerrit up again. [20:04:02] confirmed. rescheduling remaining icinga alerts [20:04:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:05:45] everything gerrit-related is GREEN now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gerrit but still in downtime [20:05:49] the only issues I found so far are comestic ones. The commentlinks regex have to be adjusted/updated. But that can be done later [20:06:05] the no more use "nofollow" [20:07:10] hashar prettifying the comments is not supported at least with the commentlink config. [20:07:12] hashar: this https://gerrit.wikimedia.org/r/c/operations/puppet/+/532391/5/modules/gerrit/templates/gerrit.config.erb ? [20:07:13] I have a test plan that I want to go through. But that'll take me about half an hour. Since we have had such a long downtime already and I don't see smoke anywhere... ok to send out an email declaring victory? [20:07:23] But i think there is a entrypoint to do this with a plugin [20:07:32] qchris: no veto [20:07:41] Thanks. [20:07:43] qchris: +1 [20:07:57] hashar, paladox You ok with it too? [20:08:00] paladox: ahh "cool" [20:08:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:08:09] qchris yup! looks good to me [20:08:38] nice work y'all. [20:09:05] hashar: You ok with declacing victory? [20:09:25] s/declacing victory/announcing that we're done/ [20:09:43] * qchris has to learn to cut down on using war terms. [20:10:01] I am french we never declare victory [20:10:08] :-) [20:10:09] we wait for our ennemies to accept their defeat ;D [20:10:11] lol [20:10:19] so yeah looks good to me [20:10:23] and Zuul looks to behave properly [20:10:24] * qchris is defeated. [20:10:24] \o/ [20:10:31] Awesome! [20:10:37] * qchris goes to write an email. [20:10:52] qchris: do ask folks to report issues at some place? ;) [20:11:17] Sure. Our Phabricator, I guess? [20:11:24] guess #gerrit in Phabricator is sufficient yes [20:12:21] so yeah +1 ;] [20:12:21] +1 [20:12:27] Cool. Will do. [20:12:58] paladox: is this upstream? https://phabricator.wikimedia.org/T256547 [20:13:06] yes [20:13:19] partial fix: https://gerrit-review.googlesource.com/c/gerrit/+/273532 [20:13:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:14:05] qchris: up to you if you want to link them to https://phabricator.wikimedia.org/T254158 or make new tickets. if you mention the upgrade ticket it can be nice or mean it keeps getting comments for minor design follow-ups [20:14:47] paladox: ahha good ;] [20:15:06] "Please test and if you run into issues, don't hesitate to file a ticket in Phabricator (Bonus points for adding the 'Gerrit' project and/or making T254158 a parent task)" [20:15:06] T254158: Gerrit 3.2 upgrade - https://phabricator.wikimedia.org/T254158 [20:15:11] Sounds ok? [20:15:11] could also be a column on https://phabricator.wikimedia.org/tag/gerrit/ [20:15:21] yes qchris [20:15:23] paladox: amazing :) [20:15:34] qchris: yea [20:15:39] Ok. [20:16:15] I would like to praise qchris for all the java patches, Paladox for all the various patches and the preliminary test setup and mutante for all the infrastructure assistance [20:16:17] that is great! [20:16:32] paladox: should i link that on the ticket? [20:16:33] Thanks to all for helping :-) [20:16:37] ^ [20:16:44] mutante yes you can [20:16:46] thanks all as well. great! [20:16:48] qchris: paladox mutante nice work! [20:17:22] hashar: nice job as well :) [20:18:34] qchris and I have exchanged phone number [20:19:46] T227562 [20:19:46] T227562: Update Gerrit documentation on user interface (for 3.1) on mediawiki.org - https://phabricator.wikimedia.org/T227562 [20:19:49] :) [20:20:02] my number is on office wiki table [20:22:17] i bet we can close some other tickets from the work board now. but it will happen soon enough :) [20:22:27] qchris: I guess you can now !log the upgrade :]] [20:22:36] Thanks! [20:22:42] !log Gerrit upgrade done. [20:22:44] congratulations! [20:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:53] Wooohoo! [20:23:54] congrats qchris [20:26:08] paladox: T227509 resolved i assume [20:26:08] T227509: Prepare Gerrit site template for upcoming Gerrit 3.x upgrade - https://phabricator.wikimedia.org/T227509 [20:26:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:27:34] T254648 [20:27:35] T254648: Double check that `Code of Conduct` and `Privacy` links are in place after site update for Gerrit 3.x - https://phabricator.wikimedia.org/T254648 [20:28:40] i see a CoC and Privacy link in the footer, is that it? [20:31:02] (03CR) 10Ladsgroup: meet: Add ferm rule to open port 5000 to the cloud proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/c/operations/puppet/+/604773 (https://phabricator.wikimedia.org/T251034) (owner: 10Ladsgroup) [20:31:39] mutante: Yes. [20:31:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:34:05] qchris: ack, resolved! [20:34:15] :-D [20:34:19] paladox: that was your change ^ [20:34:55] qchris: there is still the issue about delete project plugin being missing [20:35:18] https://gerrit.wikimedia.org/r/admin/plugins [20:35:25] hashar: I see the plugin there ^ [20:35:51] And here as well: [20:35:53] https://gerrit.wikimedia.org/r/admin/repos/test/gerrit-ping,commands [20:35:59] I get a blackbox showing: Plugin install error: TypeError: self.onAction is not a function from https://gerrit.wikimedia.org/r/plugins/delete-project/static/delete-project.js [20:36:10] Clear caches and reload the page. [20:36:21] shows for me [20:36:29] the delete project button [20:36:34] fyi that shows under ,commands [20:36:55] * hashar clears cache [20:37:00] damn [20:37:07] I don't even know how to clear my browser cache this days [20:37:24] Just wipe your disk and reinstall OS. [20:37:28] :-) [20:37:49] Shift + F5 [20:37:53] Or Ctrl + F5 [20:38:06] Something like that. (Depends on the browser you use) [20:38:30] yeah it is gone [20:38:45] guess the issue was cached somehow [20:38:48] thx ! :] [20:39:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:39:35] \o/ [20:40:12] Since that affected quite a few people ... should I send a reply with that to the "deployment done" message? [20:40:33] phab ticket mass editing in progress :) [20:40:42] qchris: can't hurt [20:40:43] mutante: hold it ;) [20:40:53] I would rather wait a bit before clearing out those tickets [20:40:56] just in case [20:41:45] i am commenting. not closing it all [20:42:14] some are obviously fixed, like "this link was a server error" and now it's not anymore [20:42:29] paladox also had all this stuff just waiting for upgrade:) [20:42:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:43:06] yeah that is great [20:43:13] i should remove the downtimes in icinga [20:43:24] once we declare it over [20:43:25] it is deleting a couple years of tech debt! [20:45:02] we have a single tag "upstream" in phabricator and a workboard for it https://phabricator.wikimedia.org/tag/upstream/ except it's about any upstream there could be, not a specific one [20:47:28] wikibugs is surprisingly silent, isnt it [20:47:58] The bot is active in #wikimedia-dev [20:48:12] Last message 10 minutes ago. [20:52:12] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/c/operations/puppet/+/608157 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [20:52:52] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) T254644 has some comments about removing this again now that Gerrit prod is on 3.2. We can also use this ticket to decom it and feel free to open it. [20:53:09] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Resolved→03Open [20:54:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10Dzahn) [20:55:13] well [20:55:16] qchris: I am off :) [20:55:30] Thanks for all your help! [20:55:32] cdanis: could you change the topic back if you are still here? [20:55:35] Sleep well :-) [20:55:47] qchris: should i activate monitoring again for everything? [20:55:49] eveyrthing looks fine. I will still have my phone near by for the next hour or so [20:55:55] mutante: Yes, please. [20:55:58] thanks chris [20:56:00] qchris: doing [20:57:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:59:12] adding downtimes is quick, deleting them is not. have to select them one by one [21:01:48] Ouch :-(( [21:07:26] yea, the ACKs are nicer for one-time alerts because they disappear by themselves, but then you have spam if stuff is flapping [21:08:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:08:15] all done. gerrit1001 and gerrit2001 have active notifications again for all services. gerrit1002 disabled as it should [21:09:10] * qchris hugs mutante [21:09:12] Thanks! [21:09:36] Anything else to do around Gerrit until next week? [21:09:37] thanks qchris! i guess we are all done [21:09:41] \o/ [21:09:47] qchris: not that i know of, no :) [21:12:06] from now on alerts would be real again, so the window is over [21:15:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:20:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:26:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:33:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:53:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:57:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200627T2200) [22:08:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:17:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:26:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:44:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:52:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:57:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:01:26] O_o [23:02:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:15:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:19:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:22:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:32:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops