[02:34:36] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [02:44:46] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:29:52] PROBLEM - ores on ores2004 is CRITICAL: connect to address 10.192.16.64 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:34:00] RECOVERY - Disk space on Hadoop worker on an-worker1103 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:55:22] RECOVERY - ores on ores2004 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201108T0800) [08:04:04] (03PS1) 10Giuseppe Lavagetto: Use a single "ssh-agent" systemd unit [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/639912 [08:04:06] (03PS1) 10Giuseppe Lavagetto: Add a script to manage the ssh configuration [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/639913 [09:03:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:47] 10Operations, 10Traffic: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BBlack) >>! In T266746#6588487, @BBlack wrote: > * for gdnsd-the-software: The gdnsd bits are addressed in a handful of commits here (not ye... [14:56:09] 10Operations, 10Wikidata, 10Wikidata Query Builder, 10User-Addshore: Deploy WDQS query builder to microsites - https://phabricator.wikimedia.org/T266703 (10Dzahn) @Addshore It's easiest if there are 2 separate deploy repos and both also do not live inside each other. 2 repos that git clone into 2 separate... [16:47:28] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 151 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:49:10] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:01:30] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10Aklapper) Could #netops please answer T180179#4965646? Asking as tasks shouldn't remain stalled forever. Thanks in advance! [17:03:12] 10Operations, 10Platform Engineering (Icebox), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10Aklapper) [17:04:19] 10Operations, 10Platform Engineering (Icebox), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10Aklapper) 05Stalled→03Open Question has been answered and this seems still wanted, hence resetting task status as task should not be [stalled](https://w... [17:09:59] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10ayounsi) That's no questions for Netops but for WMCS. [17:11:20] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:11:48] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:01:08] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:02:26] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:35:43] !log cdanis@cumin1001 START - Cookbook sre.network.cf [18:35:45] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [18:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:47] (03PS3) 10CDanis: depool esams [dns] - 10https://gerrit.wikimedia.org/r/627919 [19:16:38] (03CR) 10CDanis: [C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/627919 (owner: 10CDanis) [19:16:54] !log depool esams [19:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:26] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 45.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:29:12] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [19:32:36] expected ofc... [19:34:38] PROBLEM - MariaDB Replica Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:48:06] !log cdanis@cumin1001 START - Cookbook sre.network.cf [19:48:09] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [19:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:02] (03PS3) 10David Caro: [apt::conf] Allow passing integers as value [puppet] - 10https://gerrit.wikimedia.org/r/639778 [19:53:10] (03CR) 10David Caro: [apt::conf] Allow passing integers as value (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/639778 (owner: 10David Caro) [19:53:26] (03CR) 10jerkins-bot: [V: 04-1] [apt::conf] Allow passing integers as value [puppet] - 10https://gerrit.wikimedia.org/r/639778 (owner: 10David Caro) [19:55:12] PROBLEM - MariaDB Replica Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:58:48] (03PS4) 10David Caro: [apt::conf] Allow passing integers as value [puppet] - 10https://gerrit.wikimedia.org/r/639778 [19:59:54] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:00:09] (03CR) 10jerkins-bot: [V: 04-1] [apt::conf] Allow passing integers as value [puppet] - 10https://gerrit.wikimedia.org/r/639778 (owner: 10David Caro) [20:11:45] (03PS1) 10CDanis: temporarily route Italy to codfw [dns] - 10https://gerrit.wikimedia.org/r/639949 (https://phabricator.wikimedia.org/T262869) [20:12:11] (03PS2) 10CDanis: temporarily route Italy to codfw [dns] - 10https://gerrit.wikimedia.org/r/639949 (https://phabricator.wikimedia.org/T262869) [20:13:25] (03PS3) 10CDanis: temporarily route Italy to codfw [dns] - 10https://gerrit.wikimedia.org/r/639949 (https://phabricator.wikimedia.org/T262869) [20:14:12] (03CR) 10CDanis: [C: 03+2] temporarily route Italy to codfw [dns] - 10https://gerrit.wikimedia.org/r/639949 (https://phabricator.wikimedia.org/T262869) (owner: 10CDanis) [20:14:40] (03CR) 10Ayounsi: [C: 03+1] temporarily route Italy to codfw [dns] - 10https://gerrit.wikimedia.org/r/639949 (https://phabricator.wikimedia.org/T262869) (owner: 10CDanis) [20:20:54] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [20:21:14] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:29:17] (03PS1) 10CDanis: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/639930 [20:29:25] (03PS2) 10CDanis: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/639930 [20:32:20] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [20:32:48] RECOVERY - MariaDB Replica Lag: pc1 on pc2010 is OK: OK slave_sql_lag Replication lag: 40.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:34:18] (03CR) 10CDanis: [C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/639930 (owner: 10CDanis) [20:34:53] !log repool esams [20:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:44] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 52.95 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:46:44] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 55.84 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:49:22] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [20:49:22] (03Abandoned) 10Urbanecm: Enable DiscussionTools beta on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627833 (https://phabricator.wikimedia.org/T262984) (owner: 10Esanders) [21:15:44] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:17:30] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 79.38 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:18:22] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [22:29:42] (03PS1) 10Tim Starling: Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639952 [22:29:45] (03PS1) 10Tim Starling: Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639953 (https://phabricator.wikimedia.org/T264333) [22:30:38] (03PS1) 10Tim Starling: Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639954 [22:30:41] (03PS1) 10Tim Starling: Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639955 (https://phabricator.wikimedia.org/T264333) [22:31:57] (03CR) 10Tim Starling: [C: 03+2] Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639952 (owner: 10Tim Starling) [22:32:02] (03CR) 10Tim Starling: [C: 03+2] Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639953 (https://phabricator.wikimedia.org/T264333) (owner: 10Tim Starling) [22:32:06] (03CR) 10Tim Starling: [C: 03+2] Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639954 (owner: 10Tim Starling) [22:32:12] (03CR) 10Tim Starling: [C: 03+2] Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639955 (https://phabricator.wikimedia.org/T264333) (owner: 10Tim Starling) [22:33:41] (03PS2) 10Urbanecm: Add wgNamespaceAliases for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637870 (https://phabricator.wikimedia.org/T266925) (owner: 10Hamish) [22:54:47] (03CR) 10jerkins-bot: [V: 04-1] Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639954 (owner: 10Tim Starling) [22:56:12] (03Merged) 10jenkins-bot: Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639952 (owner: 10Tim Starling) [22:56:26] (03CR) 10jerkins-bot: [V: 04-1] Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639955 (https://phabricator.wikimedia.org/T264333) (owner: 10Tim Starling) [22:59:28] (03Merged) 10jenkins-bot: Pass along ignorewarnings param to all individual chunks being uploaded [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639953 (https://phabricator.wikimedia.org/T264333) (owner: 10Tim Starling) [22:59:39] (03Merged) 10jenkins-bot: Don't assume that warnings array will include 'code' key [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/639954 (owner: 10Tim Starling) [23:06:20] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.16/resources/src/mediawiki.Upload.js: fixing UBN T266903 (duration: 01m 35s) [23:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:29] T266903: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 [23:08:02] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.16/resources/src/mediawiki.api/upload.js: fixing UBN T266903 (duration: 01m 06s) [23:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log