[00:00:47] paladox: /puppet$ grep -r rsyslog * | grep gerrit [00:00:54] there are 3 filkes [00:00:58] oh [00:01:08] gerrit-multiline, gerrit-apache2-error and gerrit-apache2-access [00:01:26] modules/profile/manifests/gerrit/server.pp [00:02:18] aha [00:02:22] it does *.log [00:02:25] different people added them... git blame [00:02:42] b8d0a9764c9 7ad2ab4b222 [00:03:11] *_log *error*.log and *access*.log [00:03:51] apache log isnt gerrit log though [00:03:58] it's just the "gerrit-apache" [00:04:16] yup [00:04:35] so yea, leave apache as it its but /var/log/gerrit/*_log is too broad [00:04:45] yup [00:04:53] is there a way to reject certain files? [00:06:28] (03PS1) 10Dzahn: gerrit: only ship gerrit.json to logstash, not *_log [puppet] - 10https://gerrit.wikimedia.org/r/509172 [00:06:32] hmm.. certain messages in the file yes.. certain files.. hmm [00:06:49] paladox: ^ so gerrit.json only ? [00:06:55] thanks [00:06:57] since that is our new file [00:07:03] i think we want the http logs in logstash? [00:07:06] maybe ssh too? [00:07:48] (03CR) 10Paladox: gerrit: only ship gerrit.json to logstash, not *_log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509172 (owner: 10Dzahn) [00:07:49] hmm.. yea [00:08:17] arrg.. the regex. yes [00:08:20] mutante what i think we can do is use a bit of puppet 4 magic (e.g for each (supplying it with file names) [00:08:47] there is an ops session on java logging tomorrrow :) [00:08:59] yup :) [00:09:20] ['json', 'json2'].each |$key| { $key } [00:09:20] looks at rsyslog::input::file class [00:10:47] paladox: could we define different names for sshd_log and httpd_log in gerrit config? [00:10:53] yes [00:11:04] in log4j we can name them what ever we want them to be :) [00:11:16] if we call them all "gerrit_something" we can do "gerrit*" and be done [00:11:27] yup [00:11:30] gerrit_* [00:11:52] (03PS1) 10Paladox: Gerrit: Rename some gerrit logs [puppet] - 10https://gerrit.wikimedia.org/r/509173 [00:13:32] we need to match everything that does NOT duplicate stuff after https://gerrit.wikimedia.org/r/c/operations/puppet/+/508657 [00:13:51] paladox: so first step is to rename error_log per " The "error_log" is not really an error log any more, it will contain all gerrit logs. " [00:14:11] and then we need to check if "all gerrit logs" really means all [00:14:16] what should we name it to? [00:14:19] as in "files in /var/log/gerrit" [00:14:22] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@e13facb]: Downgrade LDF server back for T222471 [00:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:27] T222471: {"Accept": "application/ld+json"} causes 500 errors for LDF pages - https://phabricator.wikimedia.org/T222471 [00:14:30] if it doesn.. we just need to ship that on efile [00:14:59] mutante coulden't we keep it as error_log for now but rename the other logs to gerrit_? [00:15:00] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@e13facb]: Downgrade LDF server back for T222471 (duration: 00m 37s) [00:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:22] paladox: well.. keep it simple /var/log/gerrit/gerrit.log is not used [00:15:31] but how much does it really contain [00:15:40] it contains everything [00:15:45] even replication_log ? [00:15:50] yes [00:15:56] anything from root is logged to it [00:16:09] ok. just call it gerrit.log and we rsync gerrit.log ..no wildcards at all.. done? [00:16:57] ok. Though i thought we added gerrit.json so that (that one is logged to logstash) [00:17:01] since it's a json file. [00:17:07] then what do we have gerrit.json for though [00:17:19] for logstash [00:17:52] error_log is readable (where as gerrit.json is really only readable with logstash. [00:17:59] it duplicates the content of error_log just in a different format? [00:18:07] yup. [00:19:00] mutante maybe we can name this gerrit.log.json [00:19:19] and then add a rsyslog for that one [00:19:26] hmm. ok. so then it is as in my original patch. only gerrit.json and nothing else [00:19:52] error_log dosen't contain httpd/ssd log's though [00:20:21] hrmm. just do multiple rsyslog::input::files then.. there are already 3.. who cares if it's 45 [00:20:24] 5 [00:28:45] (03PS2) 10Dzahn: gerrit: only ship gerrit.json to logstash, not *_log [puppet] - 10https://gerrit.wikimedia.org/r/509172 [00:42:21] (03PS3) 10Dzahn: gerrit: only ship gerrit.json and sshd_log to logstash, not *_log [puppet] - 10https://gerrit.wikimedia.org/r/509172 (https://phabricator.wikimedia.org/T141324) [00:45:04] paladox: ^ so .. that ? [00:45:28] and replication_log :) [00:45:39] though yes [00:45:40] heh, ok [00:45:51] do we need to keep "startmsg_regex => '^\\\\[[0-9,-\\\\+\\\\ \\\\:]+\\\\]'," for those logs? [00:46:04] not for the json [00:46:22] yup not for json [00:46:29] not sure about the sshd_log but yes.. looks like it [00:47:14] there are weird lines in the sshd_log that dont start with a timestamp [00:47:49] hmm [00:47:57] 190 [2019-05-10 00:09:49,377 +0000] d0a264c6 jenkins-bot a/75 gerrit.review.--project.mediawiki/extensions/Wikibase.--message.Gate pipeline build succeeded. [00:48:00] 191 [00:48:02] 192 - mwgate-npm-node-6-docker https://integration.wikimedia.org/ci/job/mwgate-npm-node-6-docker/105973/console : SUCCESS in 2m 16s [00:48:05] 193 - quibble-vendor-mysql-hhvm-docker https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/48272/console : SUCCESS in 21m 44s [00:48:08] ... [00:48:10] oh [00:48:11] like that [00:48:14] i see: [2019-05-06 21:40:21,888 +0000] [01:44:25] PROBLEM - MariaDB Slave Lag: s4 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 847.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:22:29] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:23:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:23:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:23:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:23:35] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:23:35] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:23:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:23:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:24:51] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:25:03] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:26:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:26:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:26:33] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:27:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:27:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:27:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:27:41] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:27:41] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:27:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:31:43] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:32:09] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:33:17] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:34:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:34:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:44:13] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10Tgr) Maybe related to T213362#5171787? That was deployed on April 11, and this bug is somehow related to not properly killing thread... [02:53:19] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10Tgr) OTOH [[https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=4&fullscreen&orgId=1&var-server=proton1001&va... [03:02:35] RECOVERY - MariaDB Slave Lag: s4 on db2099 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:10:45] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [04:11:13] (03PS2) 10ArielGlenn: dumps: Add clickstream to list of other datasets [puppet] - 10https://gerrit.wikimedia.org/r/509136 (owner: 10Ladsgroup) [04:11:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:17:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:17:01] (03CR) 10ArielGlenn: [C: 03+2] dumps: Add clickstream to list of other datasets [puppet] - 10https://gerrit.wikimedia.org/r/509136 (owner: 10Ladsgroup) [04:17:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [04:24:39] PROBLEM - Check systemd state on ms-be1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:27:25] PROBLEM - MariaDB Slave Lag: s8 on db2100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 860.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:34:01] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:40:55] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:42:31] PROBLEM - swift-object-updater on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [04:42:41] PROBLEM - swift-container-server on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [04:43:41] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:43:45] RECOVERY - swift-object-updater on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [04:43:55] RECOVERY - swift-container-server on ms-be2017 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [04:52:07] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509179 [04:52:12] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509179 [04:53:38] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509179 (owner: 10Marostegui) [04:54:46] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509179 (owner: 10Marostegui) [04:56:05] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509179 (owner: 10Marostegui) [04:56:24] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2112 (duration: 00m 51s) [04:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:29] RECOVERY - Check systemd state on ms-be1015 is OK: OK - running: The system is fully operational [04:58:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509180 [05:01:01] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:02:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509180 (owner: 10Marostegui) [05:02:27] (03PS1) 10Marostegui: mariadb: Provision db1130 into s5 [puppet] - 10https://gerrit.wikimedia.org/r/509181 (https://phabricator.wikimedia.org/T222682) [05:03:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509180 (owner: 10Marostegui) [05:03:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509180 (owner: 10Marostegui) [05:03:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db1130 into s5 [puppet] - 10https://gerrit.wikimedia.org/r/509181 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:04:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 50s) [05:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:28] !log Stop MySQL on db1100 [05:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:05] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [05:15:17] (03PS1) 10Marostegui: install_server: Change db2114's MAC [puppet] - 10https://gerrit.wikimedia.org/r/509182 (https://phabricator.wikimedia.org/T222772) [05:16:15] (03CR) 10Marostegui: [C: 03+2] install_server: Change db2114's MAC [puppet] - 10https://gerrit.wikimedia.org/r/509182 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:23:56] (03PS1) 10Marostegui: db2105,db2109: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/509184 (https://phabricator.wikimedia.org/T222772) [05:25:27] (03CR) 10Marostegui: [C: 03+2] db2105,db2109: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/509184 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:26:29] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [05:28:35] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational [05:32:28] !log restart eventlogging daemons on eventlog1002 - kafka consumer errors in the logs, some lag built over time [05:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:27] (03PS1) 10Marostegui: db.codfw,db-eqiad.php: Pool db2105,db2109 into s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509185 (https://phabricator.wikimedia.org/T222772) [05:35:16] (03PS2) 10Marostegui: db-codfw,db-eqiad.php: Pool db2105,db2109 into s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509185 (https://phabricator.wikimedia.org/T222772) [05:37:45] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ema) [05:37:51] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ema) p:05Triage→03Normal [05:38:09] (03CR) 10Marostegui: [C: 03+2] db-codfw,db-eqiad.php: Pool db2105,db2109 into s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509185 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:38:18] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [05:39:12] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Pool db2105,db2109 into s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509185 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:39:26] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Pool db2105,db2109 into s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509185 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:40:04] !log execute kafka preferred-replica-election on kafka-jumbo1001 as attempt to rebalance traffic (1002 seems handling way more than others since some days) [05:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db2105 db2109 into s3 T222772 (duration: 00m 52s) [05:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:31] T222772: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 [05:40:36] (03PS4) 10Ema: upload-frontend: unset Cache-Control for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/509053 (https://phabricator.wikimedia.org/T222937) [05:41:03] 10Operations, 10ops-codfw, 10DBA: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) This host has been re-imaged successfully [05:41:18] (03CR) 10Ema: [C: 03+2] upload-frontend: unset Cache-Control for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/509053 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [05:41:23] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db2105 db2109 into s3 T222772 (duration: 00m 49s) [05:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:27] (03PS1) 10Marostegui: install_server: Remove db2114 [puppet] - 10https://gerrit.wikimedia.org/r/509186 [05:44:12] (03PS2) 10Marostegui: install_server: Remove db2114 [puppet] - 10https://gerrit.wikimedia.org/r/509186 [05:46:27] (03CR) 10Marostegui: [C: 03+2] install_server: Remove db2114 [puppet] - 10https://gerrit.wikimedia.org/r/509186 (owner: 10Marostegui) [05:52:36] 04Critical Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Emergency syslog message [05:53:17] uh? [05:55:28] I am checking librenms for that device, but the alert says: emergency syslog message [05:55:34] But i cannot really see what the messsage is [05:56:37] Ah, now I see it [05:57:37] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Emergency syslog message [05:57:38] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:03:13] (03PS8) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [06:03:38] (03PS6) 10Ema: swift-proxy: add ensure_max_age middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 (https://phabricator.wikimedia.org/T222937) [06:07:37] (03CR) 10Ema: "Successfully tested on deployment-prep: https://phabricator.wikimedia.org/P8505" [puppet] - 10https://gerrit.wikimedia.org/r/509027 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [06:08:58] RECOVERY - MariaDB Slave Lag: s8 on db2100 is OK: OK slave_sql_lag Replication lag: 0.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:09:09] <_joe_> !log depooling mw1261 for tests [06:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:04] (03CR) 10Ema: [C: 03+2] swift-proxy: add ensure_max_age middleware [puppet] - 10https://gerrit.wikimedia.org/r/509027 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [06:16:21] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509188 [06:17:59] !log ms-fe1005: depool and test ensure_max_age T222937 [06:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:04] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [06:18:21] PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:19:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509188 (owner: 10Marostegui) [06:20:55] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509188 (owner: 10Marostegui) [06:21:07] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509188 (owner: 10Marostegui) [06:22:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 (duration: 00m 50s) [06:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:20] (03PS1) 10ArielGlenn: remove sleep between dumps of adds/changes wikis at last [dumps] - 10https://gerrit.wikimedia.org/r/509189 (https://phabricator.wikimedia.org/T221515) [06:25:07] (03CR) 10ArielGlenn: [C: 03+2] remove sleep between dumps of adds/changes wikis at last [dumps] - 10https://gerrit.wikimedia.org/r/509189 (https://phabricator.wikimedia.org/T221515) (owner: 10ArielGlenn) [06:26:47] !log ariel@deploy1001 Started deploy [dumps/dumps@6f9a5a4]: remove sleep between incr dumps of wikis [06:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:52] !log ariel@deploy1001 Finished deploy [dumps/dumps@6f9a5a4]: remove sleep between incr dumps of wikis (duration: 00m 05s) [06:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:09] !log ms-fe1005: pool with ensure_max_age T222937 [06:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:14] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [06:29:37] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sudoers] [06:31:49] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.lookup.d/lookup_table_output.json] [06:35:55] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509193 [06:38:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509193 (owner: 10Marostegui) [06:39:28] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509193 (owner: 10Marostegui) [06:39:31] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509193 (owner: 10Marostegui) [06:40:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 into API (duration: 00m 50s) [06:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:51] (03CR) 10Muehlenhoff: [C: 03+2] Bump meta package for new ABI in 4.9.168 for jessie Drop meta package fr Linux 4.14, we didn't need it in the end [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/508815 (owner: 10Muehlenhoff) [06:49:47] RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:51:17] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509194 [06:55:53] !log swift-fe: rolling restart to enable ensure_max_age T222937 [06:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:58] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [06:58:31] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:27] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:10:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509194 (owner: 10Marostegui) [07:11:24] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509194 (owner: 10Marostegui) [07:11:29] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509194 (owner: 10Marostegui) [07:12:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1100 into API (duration: 00m 50s) [07:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:03] !log uploaded linux-meta 1.21 for jessie-wikimedia (pointing to the new -9 ABI introduced with the 4.9.168 kernel) [07:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:24] PROBLEM - MariaDB Slave Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 818.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:22:42] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1144.07 seconds Jcrespo ongoing backups https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:33:59] me raises an eyebrow [07:34:12] thanks morning typo fingers [08:13:08] (03PS14) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [08:14:08] (03PS15) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [08:15:38] (03CR) 10jerkins-bot: [V: 04-1] dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [08:16:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [08:17:17] (03PS1) 10Elukey: profile::eventlogging::analytics::server: add max_poll_records for client-side [puppet] - 10https://gerrit.wikimedia.org/r/509341 [08:19:35] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16450/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509341 (owner: 10Elukey) [08:24:47] PROBLEM - puppet last run on dns5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [08:33:19] (03PS16) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [08:35:21] RECOVERY - MariaDB Slave Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:38:14] (03PS1) 10Jcrespo: mariadb-backups: Force snapshots in certain order [puppet] - 10https://gerrit.wikimedia.org/r/509343 (https://phabricator.wikimedia.org/T206203) [08:41:03] (03PS1) 10Elukey: profile::eventlogging::analytics::server: raise timeout limits [puppet] - 10https://gerrit.wikimedia.org/r/509344 [08:42:09] (03CR) 10Elukey: [C: 03+2] profile::eventlogging::analytics::server: raise timeout limits [puppet] - 10https://gerrit.wikimedia.org/r/509344 (owner: 10Elukey) [08:44:29] (03CR) 10Marostegui: [C: 03+1] "let's deploy and see if it needs tuning/testing" [puppet] - 10https://gerrit.wikimedia.org/r/509343 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:45:49] (03PS17) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [08:50:46] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Force snapshots in certain order [puppet] - 10https://gerrit.wikimedia.org/r/509343 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:51:07] (03PS2) 10Jcrespo: mariadb-backups: Force snapshots in certain order [puppet] - 10https://gerrit.wikimedia.org/r/509343 (https://phabricator.wikimedia.org/T206203) [08:51:35] RECOVERY - puppet last run on dns5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:51:45] (03PS1) 10Elukey: profile::eventlogging::analytics::server: remove timeouts settings [puppet] - 10https://gerrit.wikimedia.org/r/509346 [08:52:44] (03CR) 10Elukey: [C: 03+2] profile::eventlogging::analytics::server: remove timeouts settings [puppet] - 10https://gerrit.wikimedia.org/r/509346 (owner: 10Elukey) [08:57:44] (03PS3) 10Jcrespo: mariadb-backups: Force snapshots in certain order [puppet] - 10https://gerrit.wikimedia.org/r/509343 (https://phabricator.wikimedia.org/T206203) [09:14:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] service::docker - add $image_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/509141 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [09:16:03] 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10fgiunchedi) a:03Papaul We're seeing error on this disk on slot 3 on this host, could we get it replaced under warranty? Thanks! The drive should be blinking... [09:17:15] !log disabling replication lag alerts for backup source hosts on s1, s4, s8 T206203 [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] T206203: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 [09:27:59] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [09:40:31] PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:36] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10jijiki) @Dzahn THANK YOU! 😍 [09:45:08] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) We have upgraded php7 on beta, so now it looks like async jobs are running. We will leave it as is until n... [09:50:06] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) [09:52:27] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) [09:52:59] (03PS18) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [09:56:41] (03PS19) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [10:00:07] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:00:35] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [10:10:12] (03PS2) 10Alexandros Kosiaris: kask: Add incubator/cassandra subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/509102 (https://phabricator.wikimedia.org/T220401) [10:11:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] kask: Add incubator/cassandra subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/509102 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [10:13:39] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10elukey) The upgrade of python-kafka to 1.4.6 on eventlog1002 coincides very well with T222941 :( [10:17:01] (03PS1) 10Ema: Add vagrant directory to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/509348 [10:19:38] (03PS2) 10Ema: Add vagrant directory to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/509348 [10:20:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) @Clarakosi @Eevans. I 've updated the chart to also conditionally install a minimal cassandra for use in m... [10:22:47] (03PS3) 10Ema: Add vagrant directory to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/509348 [10:27:29] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:03] !log installing symfony security updates [10:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'm wondering if we have a way to test this before production ? say a cassandra 2 cluster running in wmcs ?" [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [10:34:39] 10Operations, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) After some debugging on hassium, I found some interesting things: Test 1: Enabled: Beta feature, X-Wikimedia-Debug, PHP7.x https://phabricator.wikimedia.org/... [10:36:18] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10jbond) p:05Triage→03Normal [10:38:07] (03CR) 10Filippo Giunchedi: [C: 03+1] Add vagrant directory to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/509348 (owner: 10Ema) [10:40:22] (03PS1) 10Jbond: Access request: add cparle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/509355 (https://phabricator.wikimedia.org/T222864) [10:46:31] 10Operations, 10SRE-Access-Requests, 10Security-Team: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10jbond) @greg can you please approve jfishback addition to the deployment group @Nuria can you please approve jfishback addition t... [10:48:29] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [10:48:37] 10Operations, 10Traffic, 10serviceops, 10PHP 7.2 support, and 2 others: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10Joe) 05Open→03Resolved [10:48:55] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:50:06] * akosiaris looking [10:50:17] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:50:24] up already? [10:50:32] <_joe_> I see ospf flapping in various datacenters [10:51:01] <_joe_> cr2-eqiad, cr2-esams,cr1-eqsin [10:51:04] all about eqsin? [10:51:08] <_joe_> and now recovered [10:51:15] <_joe_> what did you do [10:51:33] breathed [10:51:44] heavily [10:52:02] <_joe_> less garlic at breakfast, akosiaris [10:52:18] i see a gre tunnel flapping [10:52:19] lol [10:53:11] May 10 10:46:02 re0.cr2-eqiad rpd[2126]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::2a0:a502:75:ea48 (realm ipv6-unicast gr-4/3/0.1 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode) [10:53:13] and so on [10:55:57] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10elukey) >>! In T221848#5137318, @MoritzMuehlenhoff wrote: > We can probably simply backport https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4... [11:03:14] (03PS1) 10Aaron Schulz: Set "secret" field in $wgLBFactoryConf for ChronologyProtector HMACs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509357 [11:03:31] PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2053697 https://wikitech.wikimedia.org/wiki/Varnish [11:04:23] !log restart refinery-eventlogging-saltrotate on an-coord1001 [11:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:53] jbond42: did it fail? [11:07:32] elukey: yes it did, running it manully as the hdfs user seemd to run correctly and exited with 0 was just about to ping yuo :) [11:27:05] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [11:45:21] (03CR) 10Ema: [C: 03+2] Add vagrant directory to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/509348 (owner: 10Ema) [11:52:46] !log (un)load edac kernel modules on elastic1029 to test resetting counters [11:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:35] (03PS1) 10Ema: cache: reimage cp3038 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/509362 (https://phabricator.wikimedia.org/T222937) [11:59:52] (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/16454/" [puppet] - 10https://gerrit.wikimedia.org/r/509362 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [12:17:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/509362 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [12:19:22] (03PS1) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [12:19:42] !log depool cp3038 and reimage as upload_ats T222937 [12:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:46] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [12:21:26] (03CR) 10Ema: [C: 03+2] cache: reimage cp3038 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/509362 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [12:23:09] 10Operations, 10DC-Ops, 10Traffic, 10observability, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 (10jbond) >>! In T183177#4191082, @fgiunchedi wrote: > The correctable errors check has been deployed and it is yielding some results already. Myself and @herron... [12:27:17] RECOVERY - Long running screen/tmux on notebook1004 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [12:28:19] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3038.esams.wmnet'] ` The log can be found in `... [12:58:21] 10Operations, 10ops-eqiad, 10cloud-services-team: cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10jbond) [13:02:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5170415, @Ottomata wrote: > I think there are a few more branches: > > - prod... [13:04:51] (03CR) 10Paladox: "@godog wondering if you can review please? :)" [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [13:04:58] (03PS5) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [13:05:07] (03PS17) 10Paladox: Gerrit: Set plugin.javamelody.prometheusBearerToken [puppet] - 10https://gerrit.wikimedia.org/r/508621 (https://phabricator.wikimedia.org/T184086) [13:05:09] (03CR) 10Filippo Giunchedi: "> Patch Set 7:" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [13:06:06] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10fgiunchedi) Thank you @RobH ! Yes I'll take it from here and hand over as needed when done [13:10:03] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3038.esams.wmnet'] ` and were **ALL** successful. [13:10:22] (03CR) 10CDanis: "adding volans to reviewers; and see also comments on I355dec2" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [13:14:02] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: fix saltrotate command [puppet] - 10https://gerrit.wikimedia.org/r/509387 (https://phabricator.wikimedia.org/T212014) [13:15:21] (03PS2) 10Elukey: profile::analytics::refinery::job::data_purge: fix saltrotate command [puppet] - 10https://gerrit.wikimedia.org/r/509387 (https://phabricator.wikimedia.org/T212014) [13:16:32] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @WMDE-leszek. Yes I did. Using https://locust.io/, wrote P8511 and benchmarked the service locally on my minikube... [13:17:36] PROBLEM - IPMI Sensor Status on cp3038 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [13:18:16] (03PS7) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [13:22:42] RECOVERY - Long running screen/tmux on lithium is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [13:22:53] (03PS3) 10Elukey: profile::analytics::refinery::job::data_purge: fix saltrotate command [puppet] - 10https://gerrit.wikimedia.org/r/509387 (https://phabricator.wikimedia.org/T212014) [13:24:28] (03CR) 10Ottomata: [C: 03+1] profile::analytics::refinery::job::data_purge: fix saltrotate command [puppet] - 10https://gerrit.wikimedia.org/r/509387 (https://phabricator.wikimedia.org/T212014) (owner: 10Elukey) [13:25:45] (03CR) 10Ottomata: [C: 03+1] profile::eventlogging::analytics::server: raise timeout limits [puppet] - 10https://gerrit.wikimedia.org/r/509344 (owner: 10Elukey) [13:27:49] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) That'd be fine! [13:30:13] (03CR) 10Mforns: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/509387 (https://phabricator.wikimedia.org/T212014) (owner: 10Elukey) [13:30:21] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3038 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema Known since months https://phabricator.wikimedia.org/T203272 [13:30:33] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: fix saltrotate command [puppet] - 10https://gerrit.wikimedia.org/r/509387 (https://phabricator.wikimedia.org/T212014) (owner: 10Elukey) [13:30:41] !log pool cp3038 w/ ATS backend T222937 [13:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:51] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [13:32:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) > I'm not sure I see the value in breaking up the broker name into broker_hostname, broker_id,... [13:37:26] (03PS1) 10Ema: cache_upload esams: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/509394 (https://phabricator.wikimedia.org/T222937) [13:40:17] (03CR) 10Ema: "pcc is fine https://puppet-compiler.wmflabs.org/compiler1001/16457/" [puppet] - 10https://gerrit.wikimedia.org/r/509394 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [13:40:50] (03CR) 10Ema: [C: 03+2] cache_upload esams: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/509394 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [13:43:16] (03PS1) 10Elukey: profile::analytics::refinery::job: remove trailing typo from erb [puppet] - 10https://gerrit.wikimedia.org/r/509397 [13:44:05] (03PS2) 10Elukey: profile::analytics::refinery::job: remove trailing typo from erb [puppet] - 10https://gerrit.wikimedia.org/r/509397 [13:45:05] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job: remove trailing typo from erb [puppet] - 10https://gerrit.wikimedia.org/r/509397 (owner: 10Elukey) [13:54:37] (03PS1) 10Ottomata: (Attempt to) Fix deadlock between consumer and heartbeat (#1628) [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/509400 [13:54:59] (03CR) 10Ottomata: [V: 03+2 C: 03+2] (Attempt to) Fix deadlock between consumer and heartbeat (#1628) [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/509400 (owner: 10Ottomata) [13:56:52] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @Tarrow , @WMDE-leszek I 've noticed 3 things while working on the above * The service seems to be configurable to... [13:58:31] (03PS1) 10Elukey: profile::eventlogging::analytics::server: remove unneeded max_poll_records [puppet] - 10https://gerrit.wikimedia.org/r/509403 (https://phabricator.wikimedia.org/T222941) [14:03:03] (03PS1) 10Ema: upload_ats: include ats-be in backend_services [puppet] - 10https://gerrit.wikimedia.org/r/509405 (https://phabricator.wikimedia.org/T222937) [14:05:58] (03CR) 10Mathew.onipe: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/508809 (https://phabricator.wikimedia.org/T193017) (owner: 10Mathew.onipe) [14:08:57] (03PS1) 10Filippo Giunchedi: cassandra: check for flag file before service startup [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) [14:09:43] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) a:03Andrew [14:10:40] 10Operations, 10ops-eqiad, 10cloud-services-team: cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10jbond) p:05Triage→03Normal [14:11:14] (03CR) 10Ema: [C: 03+2] "pcc looks good https://puppet-compiler.wmflabs.org/compiler1001/16458/" [puppet] - 10https://gerrit.wikimedia.org/r/509405 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [14:12:02] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/16459/restbase1010.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [14:17:07] (03PS2) 10Elukey: profile::eventlogging::analytics::server: remove unneeded max_poll_records [puppet] - 10https://gerrit.wikimedia.org/r/509403 (https://phabricator.wikimedia.org/T222941) [14:18:34] (03CR) 10Elukey: [C: 03+2] profile::eventlogging::analytics::server: remove unneeded max_poll_records [puppet] - 10https://gerrit.wikimedia.org/r/509403 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [14:20:28] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:21:45] (03PS8) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [14:22:07] (03CR) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [14:27:12] (03PS2) 10Ottomata: service::docker - add $image_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/509141 (https://phabricator.wikimedia.org/T218346) [14:30:05] (03CR) 10Ottomata: [C: 03+2] service::docker - add $image_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/509141 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:30:06] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-codfw-values.yaml production stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [14:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:10] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster codfw completed [14:30:10] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:38] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-eqiad-values.yaml production stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [14:30:40] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:30:40] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [14:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:55] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f lala.yaml staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [14:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:57] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:32:57] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [14:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:12] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) [14:39:26] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) [14:41:44] (03PS2) 10Krinkle: Set wgLocalisationCacheConf['storeClass'] explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508726 (https://phabricator.wikimedia.org/T99740) [14:49:38] (03PS1) 10Filippo Giunchedi: cassandra: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509422 (https://phabricator.wikimedia.org/T219404) [14:49:41] (03PS1) 10Filippo Giunchedi: conftool-data: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) [14:49:43] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf8 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) 05Resolved→03Open Eventually the backport patch for T140297 was wrong. It does not apply from Zuu... [14:49:45] (03CR) 10Krinkle: [C: 03+2] Set wgLocalisationCacheConf['storeClass'] explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508726 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [14:49:48] * Krinkle staging on mwdebug1002 [14:50:56] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf9 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) [14:51:33] (03Merged) 10jenkins-bot: Set wgLocalisationCacheConf['storeClass'] explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508726 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [14:51:46] (03CR) 10jenkins-bot: Set wgLocalisationCacheConf['storeClass'] explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508726 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [14:53:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf9 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) [14:54:23] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: T99740 / d9dbecad9c7b (duration: 00m 51s) [14:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:31] T99740: Use static php array files for l10n cache instead of CDB - https://phabricator.wikimedia.org/T99740 [14:55:26] (03PS1) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509426 (https://phabricator.wikimedia.org/T144169) [14:56:05] !log uploade zuul_2.5.10-wmf9 to jessie-wikimedia [14:56:06] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [14:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:57:06] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:57:24] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf9 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10jbond) 05Open→03Resolved a:03jbond latest package has been uploaded re open if further problems [14:57:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:57:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:57:52] PROBLEM - PHP7 rendering on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 861 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:57:58] PROBLEM - PHP7 rendering on mw1268 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 494 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:58:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:58:37] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/16460/restbase1010.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509422 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [14:59:28] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:59:46] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:59:57] mhh looks like an already-recovered spike [14:59:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:00:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:00:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:00:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:01:12] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:01:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:01:34] godog: I'm going to guess that Krinkle's deploy inadvertently caused some opcache corruption on those two appservers with the PHP7 rendering warnings [15:01:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:02:08] cdanis: indeed, snafy [15:02:14] fixing now by manually pulling on those [15:02:46] good thinking, thanks cdanis Krinkle [15:02:47] cdanis: limited to php7 for what it's worth [15:02:57] 99% of the requests were from monitoring BlankPage?force_php7=1 [15:03:01] <_joe_> yes [15:03:05] I dind't know that was hit so heavily [15:03:06] <_joe_> wihch is what should happen [15:03:07] where is that from? [15:03:10] <_joe_> pybal [15:03:20] <_joe_> once it's down it checks every few seconds [15:03:20] RECOVERY - PHP7 rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 80489 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:03:20] (03CR) 10Volans: monitoring: add notes url for memory errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:03:33] <_joe_> did someone fix that? [15:03:38] <_joe_> mw1239 I mean [15:03:51] <_joe_> I would prefer to see if they autorecover or not [15:03:52] !log ran 'scap pull' on mw1239.eqiad.wmnet to fix opcache corruption [15:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:04:04] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:04:16] <_joe_> Krinkle: so the fatals are all from monitoring? [15:04:20] May 10 14:54:18 mw1268 php7.2-fpm[16873]: PHP Fatal error: Uncaught Error: Call to undefined method MediaWiki\Revision\RevisionStore::newRevisionFromRow() in /srv/mediawiki/php-1.34.0-wmf.4/includes/Revision/RevisionStore.php:2166 [15:04:27] PROBLEM - Check Varnish expiry mailbox lag on cp3034 is CRITICAL: CRITICAL: expiry mailbox lag is 2040696 https://wikitech.wikimedia.org/wiki/Varnish [15:04:46] _joe_: well, and some small amount of real user traffic naturally. [15:04:57] !log cdanis@mw1268.eqiad.wmnet ~ % sudo php7adm /opcache-free [15:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:09] !log fix opcache krinkle@mw1268:~$ scap pull [15:05:10] https://logstash.wikimedia.org/goto/60777f6d1c64f147763bfb04ba18caac FWIW [15:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:19] for the 5xx from varnish that is [15:05:27] !log cdanis@mw1239.eqiad.wmnet ~ % sudo php7adm /opcache-free [15:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] _joe_: I can't see what is depooled or not, but as long as they scap the main prod metrics and make my dashboards useless, I'd rather not do nothing and hope it's not affecting users. [15:05:49] as they affect* (not scap) [15:06:02] PROBLEM - swift-account-reaper on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:06:03] PROBLEM - swift-container-auditor on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:06:08] RECOVERY - PHP7 rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 80489 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:06:16] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:06:16] that was a scary deployment. [15:06:20] <_joe_> Krinkle: that's fair, we need a way to detect this issue and autorecover from it probably [15:06:22] May 10 14:54:21 mw1239 php7.2-fpm[11564]: PHP Fatal error: Cannot declare class Wikibase\Lib\Store\EntityRevision, because the name is already in use in /srv/mediawiki/php-1.34.0-wmf.4/extensions/Wikibase/lib/includes/Store/EntityRevision.php on line 19 [15:06:22] PROBLEM - dhclient process on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:06:26] <_joe_> but it's not new sadly [15:06:34] PROBLEM - swift-account-server on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:06:36] <_joe_> opcache gets corrupted [15:06:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:06:40] cdanis: yeah, all snafu and known in phab. [15:06:44] PROBLEM - very high load average likely xfs on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:06:48] PROBLEM - swift-object-replicator on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:06:50] PROBLEM - DPKG on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:06:55] I'll check on ms-be2014 [15:06:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:06:59] <_joe_> I'm even willing to change the system not to explicitly clean opcache [15:07:00] PROBLEM - Check size of conntrack table on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:07:02] PROBLEM - swift-object-server on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:07:10] PROBLEM - Disk space on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:07:12] Unfortunately our regular error rates are sufficiently spammy that you can't use the normal dashboards to find out whether stuff is broken. [15:07:18] PROBLEM - swift-object-auditor on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:07:18] PROBLEM - swift-account-auditor on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [15:07:22] PROBLEM - MD RAID on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:07:25] You have to look it at every day and know which ones are normal, and which ones are not to be able to interpret it. [15:07:33] <_joe_> yes :/ [15:07:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:07:42] PROBLEM - configured eth on ms-be2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.32: Connection reset by peer [15:07:58] RECOVERY - very high load average likely xfs on ms-be2014 is OK: OK - load average: 16.63, 14.06, 11.18 https://wikitech.wikimedia.org/wiki/Swift [15:08:04] RECOVERY - swift-object-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [15:08:04] RECOVERY - DPKG on ms-be2014 is OK: All packages OK [15:08:15] silenced ms-be2014, we'll still get recovery spam tho [15:08:16] RECOVERY - Check size of conntrack table on ms-be2014 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:08:18] RECOVERY - swift-object-server on ms-be2014 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [15:08:22] RECOVERY - Disk space on ms-be2014 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:08:27] godog: sigh, I think my disk scheduler attempt did nothing [15:08:32] RECOVERY - swift-account-auditor on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [15:08:32] RECOVERY - swift-object-auditor on ms-be2014 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [15:08:38] RECOVERY - MD RAID on ms-be2014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:08:42] RECOVERY - swift-account-reaper on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [15:08:42] RECOVERY - swift-container-auditor on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [15:08:45] _joe_: the rate at which pyball did that seemed a bit scary. I don't know if that's the same rate pyball normally uses and that I've just not noticed before (given it's rate). But might be something we want to back off a little. [15:08:48] what's going on? [15:08:54] (have not actually looked at the data yet, but ms-be2014 at this point should not be doing anything except deleting partitions that have been replicated) [15:08:56] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [15:08:58] RECOVERY - configured eth on ms-be2014 is OK: OK - interfaces up [15:09:02] RECOVERY - dhclient process on ms-be2014 is OK: PROCS OK: 0 processes with command name dhclient [15:09:03] given it's rare** [15:09:12] RECOVERY - swift-account-server on ms-be2014 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [15:09:14] ema: monitoring noise from ms-be2014; php7 opcache corruption (fixed) on two appservers; griping :) [15:11:09] cdanis: yeah :| I wonder what happened [15:12:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:13:00] godog: it took me quite a long time (almost 10 s) for my ssh session to open on ms-be2014 right now... looking at the host dashboard seems like lots of memory pressure [15:13:57] oh, hm, that's not 'real' memory pressure, just lots of iobuffers being deallocated maybe? [15:16:10] PROBLEM - Check Varnish expiry mailbox lag on cp3036 is CRITICAL: CRITICAL: expiry mailbox lag is 2077550 https://wikitech.wikimedia.org/wiki/Varnish [15:16:33] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 4 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10fgiunchedi) [15:18:40] cdanis: maybe! as I ssh'd in load was already decreasing [15:21:52] (03PS1) 10CRusnov: Also exclude `FAILED` state from PuppetDB reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509431 [15:25:04] jouncebot: now [15:25:04] No deployments scheduled for the next 67 hour(s) and 4 minute(s) [15:25:10] (03CR) 10Eevans: [C: 03+1] "I guess ideally you'd negate the condition for flag file that would prevent startup, but I can't think of how you'd do so while avoiding t" [puppet] - 10https://gerrit.wikimedia.org/r/509409 (https://phabricator.wikimedia.org/T214166) (owner: 10Filippo Giunchedi) [15:25:36] <_joe_> !log wiped opcache clean on all api, appservers [15:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:25] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) During a deployment today (sync-file wmf-config/CommonS... [15:28:28] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10Krinkle) [15:28:31] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [15:32:08] (03CR) 10Ottomata: "FYI, This patch was causing git-sync-upstream's rebase to fail on deployemnt-puppetmaster03. I removed it from the rebase list." [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [15:33:00] (03PS1) 10Michael Große: Add EntitySchema to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) [15:33:17] (03PS1) 10CDanis: php-admin: always full-reset opcache, ignore filename list [puppet] - 10https://gerrit.wikimedia.org/r/509438 (https://phabricator.wikimedia.org/T221347) [15:34:07] 10Operations, 10serviceops: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10Krinkle) 05Open→03Resolved Appears to be resolved. Re-open if I misunderstood :) [15:34:12] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 3 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [15:34:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] php-admin: always full-reset opcache, ignore filename list [puppet] - 10https://gerrit.wikimedia.org/r/509438 (https://phabricator.wikimedia.org/T221347) (owner: 10CDanis) [15:34:27] (03PS1) 10ArielGlenn: dump special files are per run [dumps] - 10https://gerrit.wikimedia.org/r/509439 (https://phabricator.wikimedia.org/T222948) [15:37:00] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [15:37:07] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10Joe) 05Open→03Resolved [15:38:57] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [15:39:00] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Reedy) [15:39:20] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Reedy) [15:40:50] (03PS1) 10Ottomata: Keep using $title for service config dir name, not $image_name [puppet] - 10https://gerrit.wikimedia.org/r/509442 (https://phabricator.wikimedia.org/T218346) [15:41:35] (03CR) 10Ottomata: [C: 03+2] Keep using $title for service config dir name, not $image_name [puppet] - 10https://gerrit.wikimedia.org/r/509442 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [15:46:14] (03PS1) 10Ema: ATS: do not cache server errors [puppet] - 10https://gerrit.wikimedia.org/r/509443 (https://phabricator.wikimedia.org/T222937) [15:46:18] (03PS1) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) [15:47:03] (03CR) 10jerkins-bot: [V: 04-1] ATS: do not cache server errors [puppet] - 10https://gerrit.wikimedia.org/r/509443 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [15:47:11] (03CR) 10jerkins-bot: [V: 04-1] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [15:49:25] (03CR) 10CRusnov: [C: 03+2] Also exclude `FAILED` state from PuppetDB reports [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509431 (owner: 10CRusnov) [15:49:35] (03Abandoned) 10Jforrester: Add WikibaseSchema to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505813 (https://phabricator.wikimedia.org/T221650) (owner: 10Lucas Werkmeister (WMDE)) [15:50:51] (03CR) 10Jforrester: [C: 04-2] "Looks good. This has to wait until both the current and immediately previous branches in production (i.e., now that I768bd35a33 has landed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) (owner: 10Michael Große) [15:51:12] (03PS2) 10Ema: ATS: do not cache server errors [puppet] - 10https://gerrit.wikimedia.org/r/509443 (https://phabricator.wikimedia.org/T222937) [15:53:03] (03PS1) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 [15:53:41] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [15:54:00] (03PS2) 10Bstorm: wiki replicas: apply black formatting to maintain-views.py [puppet] - 10https://gerrit.wikimedia.org/r/509147 [15:54:26] (03PS2) 10Bstorm: wiki replicas: Fix misconfiguration in the views [puppet] - 10https://gerrit.wikimedia.org/r/509148 (https://phabricator.wikimedia.org/T212972) [15:55:35] (03PS2) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) [15:55:51] (03CR) 10Bstorm: [C: 03+2] wiki replicas: apply black formatting to maintain-views.py [puppet] - 10https://gerrit.wikimedia.org/r/509147 (owner: 10Bstorm) [15:56:59] (03CR) 10jerkins-bot: [V: 04-1] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [15:57:28] (03CR) 10Jbond: "The CI message from flake8 appears to be because it is using python2 flake8 and not python3 flake8" [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [15:59:29] (03PS1) 10Ottomata: beta - Use eventgate services on new deployment-eventgate-1 instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) [16:00:17] (03CR) 10Michael Große: "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) (owner: 10Michael Große) [16:00:23] (03CR) 10jerkins-bot: [V: 04-1] beta - Use eventgate services on new deployment-eventgate-1 instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [16:03:26] (03PS2) 10Ottomata: beta - Use eventgate services on new deployment-eventgate-1 instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) [16:06:02] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [16:08:47] \o/ [16:09:50] (03PS1) 10Ema: ATS: disable Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509456 (https://phabricator.wikimedia.org/T222937) [16:09:53] (03CR) 10ArielGlenn: [C: 03+2] dump special files are per run [dumps] - 10https://gerrit.wikimedia.org/r/509439 (https://phabricator.wikimedia.org/T222948) (owner: 10ArielGlenn) [16:11:17] !log ariel@deploy1001 Started deploy [dumps/dumps@70e8498]: look for dumpstatus json file per wiki run [16:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:23] !log ariel@deploy1001 Finished deploy [dumps/dumps@70e8498]: look for dumpstatus json file per wiki run (duration: 00m 05s) [16:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:51] (03PS2) 10Ema: ATS: disable Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509456 (https://phabricator.wikimedia.org/T222937) [16:14:38] 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10HMarcus) Hi all, Any update with this? We are looking to get some projects moving over here with this equipment. Thanks, Harry [16:16:41] (03PS3) 10Ema: ATS: disable Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509456 (https://phabricator.wikimedia.org/T222937) [16:17:44] (03PS1) 10Bstorm: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 [16:18:00] (03PS1) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509459 (https://phabricator.wikimedia.org/T144169) [16:18:07] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Fix misconfiguration in the views [puppet] - 10https://gerrit.wikimedia.org/r/509148 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [16:18:37] (03CR) 10Jbond: [C: 03+2] Access request: add cparle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/509355 (https://phabricator.wikimedia.org/T222864) (owner: 10Jbond) [16:18:48] (03PS2) 10Jbond: Access request: add cparle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/509355 (https://phabricator.wikimedia.org/T222864) [16:19:40] (03PS4) 10Ema: ATS: disable Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509456 (https://phabricator.wikimedia.org/T222937) [16:21:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10jbond) This should be complete now p[lease allow up to 30 minutes for puppet to role out the change any issues after that please re-open the ticket [16:21:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Cormac Parle - https://phabricator.wikimedia.org/T222864 (10jbond) 05Open→03Resolved a:03jbond [16:22:12] (03PS5) 10Ema: ATS: disable Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509456 (https://phabricator.wikimedia.org/T222937) [16:23:47] (03CR) 10Ema: [C: 03+2] ATS: disable Negative Response Caching [puppet] - 10https://gerrit.wikimedia.org/r/509456 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [16:28:34] (03PS1) 10CRusnov: Revert "Also exclude `FAILED` state from PuppetDB reports" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509460 [16:30:15] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Revert "Also exclude `FAILED` state from PuppetDB reports" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509460 (owner: 10CRusnov) [16:31:26] !log drop archive indices from cloudelastic [16:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:07] (03PS1) 10Elukey: role::mediawiki::canary_appserver: remove nutcracker memcached conf [puppet] - 10https://gerrit.wikimedia.org/r/509462 (https://phabricator.wikimedia.org/T214275) [16:39:53] (03PS2) 10Bstorm: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) [16:40:15] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10ayounsi) p:05Triage→03Normal [16:47:26] (03PS2) 10Reedy: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508595 [16:47:30] (03CR) 10Reedy: [C: 03+2] Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508595 (owner: 10Reedy) [16:48:15] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [16:48:38] (03Merged) 10jenkins-bot: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508595 (owner: 10Reedy) [16:48:51] (03CR) 10jenkins-bot: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508595 (owner: 10Reedy) [16:50:06] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10ayounsi) Or could be T211715#4835950 again. [16:50:07] !log reedy@deploy1001 Synchronized dblists/: Update size related dblists (duration: 00m 49s) [16:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:02] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:55:12] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [16:55:23] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) (owner: 10Michael Große) [16:58:57] (03PS2) 10BBlack: Undo "www.wikipedia.org" direct DYNA [dns] - 10https://gerrit.wikimedia.org/r/509057 (https://phabricator.wikimedia.org/T208263) [16:59:08] (03PS1) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509467 (https://phabricator.wikimedia.org/T144169) [16:59:32] (03CR) 10Dzahn: [C: 03+1] "+1 to the general idea. Didn't check the exact syntax inside the .erb, would have probably done that in puppet, but if we can compile that" [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [16:59:36] (03CR) 10BBlack: [C: 03+2] Undo "www.wikipedia.org" direct DYNA [dns] - 10https://gerrit.wikimedia.org/r/509057 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:03:39] (03CR) 10Dzahn: [C: 04-1] "so afaict..if we want to keep separate log files for each type of event then we should keep additivity = false. If we want to unite everyt" [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [17:05:38] (03PS6) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) [17:07:40] (03CR) 10BBlack: "PS6 updates for the switch to dyna.wikimedia.org, should wait until early next week for deploy." [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:10:10] RECOVERY - Check Varnish expiry mailbox lag on cp3036 is OK: OK: expiry mailbox lag is 204640 https://wikitech.wikimedia.org/wiki/Varnish [17:11:11] (03CR) 10Dzahn: [C: 04-1] "flogger was just mentioned briefly in Gehel's talk about Java logging today." [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [17:11:42] (03CR) 10Rush: [C: 03+1] "thanks man for handling this!" [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [17:11:42] mutante flogger is the frontend for gerrit [17:11:50] so flogger connects to log4j2. [17:11:58] err log4j i mean [17:12:13] (03PS1) 10Bstorm: cloudstore: switch scratch mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509469 (https://phabricator.wikimedia.org/T209527) [17:12:55] paladox: yes, i saw the backend stays the same [17:12:59] (03CR) 10Paladox: "> flogger was just mentioned briefly in Gehel's talk about Java" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [17:13:19] paladox: amusingly the first thing in the talk was "DO NOT use log4j" haha [17:13:26] lol [17:13:30] indeed [17:13:39] im planning on working on log4j2 support in gerrit [17:13:45] which i already have patches for. [17:13:56] paladox: but still.. does it mean we have to change any other config if the frontend changes? [17:14:01] though as flogger is now in gerrit, i have to add support for log4j2 there first. [17:14:05] paladox: oh, sounds nice [17:14:06] (03CR) 10Gehel: "> Patch Set 15:" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [17:14:38] ok [17:14:49] https://gerrit-review.googlesource.com/c/gerrit/+/142811 [17:14:53] it's a big change :P [17:15:49] mutante i've already converted our log4j config to log4j2 :) [17:15:54] since i was using it to test [17:16:44] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-Logging, 10Core Platform Team (PHP7 (TEC4)), and 5 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10herron) [17:17:54] paladox: sounds good but we should decide which to do first then to avoid going through it to change it again soon after? [17:18:46] mutante what do you mean by decide? [17:18:57] the log4j2 change would be in at least gerrit 3.1 [17:19:06] RECOVERY - Check Varnish expiry mailbox lag on cp3034 is OK: OK: expiry mailbox lag is 201994 https://wikitech.wikimedia.org/wiki/Varnish [17:19:06] if i manage to get log4j2 working in flogger. [17:20:14] mutante in gerrit we have to use flogger for log4j to work. [17:20:22] they switched all log functions to use it [17:22:03] paladox: 3.1 ? ok.. that will be a while. ok [17:22:08] yup [17:22:19] gerrit 2.16 has been released since novemeber 14 [17:22:26] and gerrit 3.0 is being released next week [17:22:29] at the hackathon [17:22:33] (03CR) 10Dzahn: [C: 03+1] Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [17:23:55] (03PS1) 10Bstorm: cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) [17:28:38] (03CR) 10Bstorm: [C: 03+1] "The archiver part looks good from the WMCS side." [puppet] - 10https://gerrit.wikimedia.org/r/509426 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [17:30:28] (03CR) 10Bstorm: [C: 03+1] "WMCS script looks great." [puppet] - 10https://gerrit.wikimedia.org/r/509459 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [17:32:42] (03PS1) 10Dzahn: rsync::quickdatacopy: add data types [puppet] - 10https://gerrit.wikimedia.org/r/509472 [17:36:35] (03PS2) 10Dzahn: rsync::quickdatacopy: add data types [puppet] - 10https://gerrit.wikimedia.org/r/509472 [17:37:07] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: add data types [puppet] - 10https://gerrit.wikimedia.org/r/509472 (owner: 10Dzahn) [17:39:32] (03CR) 10Bstorm: [C: 04-1] "The flag doesn't have a fallback in the app. I think it will default to broken rather than true. This should be set to true rather than " [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [17:42:55] (03PS3) 10Dzahn: rsync::quickdatacopy: add data types [puppet] - 10https://gerrit.wikimedia.org/r/509472 [17:45:04] (03CR) 10Reedy: "I note Bryan did the same in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/507351/ (so this is a dupe)" [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [17:48:23] (03PS1) 10Dzahn: acme_chief: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/509475 [17:53:10] (03PS1) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) [17:53:42] (03PS1) 10Dzahn: mirrrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 [17:57:45] (03PS2) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 [17:57:55] (03CR) 10CDanis: [C: 03+1] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509459 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [17:58:14] (03PS1) 10Dzahn: haproxy: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509478 [18:03:16] (03PS3) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 [18:06:24] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1002/16468/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [18:14:09] (03CR) 10Bstorm: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/508944 (https://phabricator.wikimedia.org/T222844) (owner: 10Reedy) [18:15:25] bstorm_: Want me to make a patch just setting it to true, then it can be cleaned up later? [18:15:36] That way, we can re-enable account creation [18:15:38] (03CR) 10Bstorm: [C: 04-1] "I think this will default to false or broken looking at the code. We should make it true instead." [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis) [18:15:53] Yes. I think so. [18:16:25] I also -1'd bd8.08's patch because I stand by my assessment after reviewing more ;-) [18:17:03] Heh, that's fine :) [18:17:11] (03CR) 10CDanis: [C: 03+1] "lgtm with a nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) (owner: 10Jbond) [18:17:17] (03PS1) 10Reedy: Re-enable striker account creation [puppet] - 10https://gerrit.wikimedia.org/r/509482 (https://phabricator.wikimedia.org/T222844) [18:19:39] (03PS2) 10Dzahn: haproxy: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509478 (https://phabricator.wikimedia.org/T197873) [18:20:49] (03CR) 10Dzahn: [C: 03+2] haproxy: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509478 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:22:15] 10Operations, 10Traffic, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10ori) >>! In T137979#4118215, @BBlack wrote: > Re-reading above: probably the better blend of options would be to swap gzip for brotli in Varnish one-for-one (without the whole storin... [18:26:35] (03PS1) 10Jbond: flake8: puppetmaster - Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509484 (https://phabricator.wikimedia.org/T144169) [18:27:25] (03CR) 10jerkins-bot: [V: 04-1] flake8: puppetmaster - Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509484 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [18:28:19] (03PS2) 10Jbond: flake8: puppetmaster - Add python extension so scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509484 (https://phabricator.wikimedia.org/T144169) [18:29:29] (03CR) 10Jbond: [C: 03+2] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509459 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [18:29:38] (03PS2) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509459 (https://phabricator.wikimedia.org/T144169) [18:35:12] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:35:22] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:35:43] jbond42: ^ [18:35:50] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:36:00] is that one of the scripts changed? [18:36:08] PROBLEM - puppet last run on dbproxy1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:36:48] PROBLEM - puppet last run on cloudelastic1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:37:22] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:37:26] PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:37:32] PROBLEM - puppet last run on ms-fe1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:37:41] I'm going to disable puppet across the fleet [18:37:44] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:37:52] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:38:14] what happen [18:38:16] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [18:38:20] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:38:23] Oh I see the py change? [18:39:00] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:39:22] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin '*' 'disable-puppet "Puppet breakages on all hosts -- cdanis"' [18:39:22] PROBLEM - puppet last run on elastic2033 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats],File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:39:51] cdanis: yes looks like it [18:40:00] May 10 18:32:04 mw1233 puppet-agent[36781]: (/Stage[main]/Prometheus::Node_puppet_agent/File[/usr/local/bin/prometheus-puppet-agent-stats]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/prometheus/usr/local/bin/prometheus-puppet-agent-stats [18:40:04] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:40:14] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:40:46] not sure what is referencing the old filename [18:40:59] (03PS2) 10Dzahn: mirrrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 [18:41:02] looking [18:41:15] (03PS3) 10Dzahn: mirrrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) [18:41:44] (03PS4) 10Dzahn: mirrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) [18:42:28] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:42:30] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:45:58] cdanis: looks good now [18:46:18] i thikn this could be cause by the issue mentioned by alex on the puppet faliures task [18:46:23] one sec let me get the link [18:46:32] just, inconsistent puppet file serving? [18:46:35] oh the overlapping something something [18:47:06] https://phabricator.wikimedia.org/T221529#5159469 [18:47:17] cdanis: yes [18:48:03] chaomodus: also yes :) [18:48:11] :) [18:48:43] we should make puppet-merge log its invocations somewhere heh [18:48:47] ok I will re-enable puppet everywhere [18:48:58] great thanks [18:49:18] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin '*' 'enable-puppet "Puppet breakages on all hosts -- cdanis"' [18:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:23] seems like a relatively easy change to log puppet-merge :) [18:49:28] and as to puppet-merge yes at the moment it dosn't log anywhere, i think it shuld be easy enough to update it to log here [18:49:32] now- making an event in grafana [18:49:34] (03PS1) 10Dzahn: udp2log: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) [18:49:45] I think in the past we had races even without multiple masters involved (in other words, I don't think the merge of the last git update, even on a single master, is atomic... having all the file resources match the new manifest can take a while) [18:50:28] bblack: yes thats the issue we just hit [18:50:32] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:50:43] (03CR) 10Dzahn: "udp2log is obsolete but somehow also still used, note how wikitech redirects to https://wikitech.wikimedia.org/wiki/Obsolete:Squid_logging" [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:50:50] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:51:00] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:51:06] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:51:06] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:51:18] PROBLEM - puppet last run on elastic2036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 seconds ago with 1 failures. Failed resources (up to 3 shown) [18:51:18] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:51:18] PROBLEM - puppet last run on elastic2050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:51:20] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only' [18:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:28] it's possible to work around this problem by breaking up commits into multiple stages, but it's a huge PITA and I normally wouldn't [18:51:30] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:51:44] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:51:50] (e.g. break out a change to add some new file resources which are unused, then later deploy the manifest commit that uses them) [18:51:56] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:52:03] PROBLEM - puppet last run on elastic2042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 0 seconds ago with 1 failures. Failed resources (up to 3 shown) [18:52:06] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:52:10] yeah it would have been extra annoying for this commit (and muddied the git history) [18:52:14] (03PS2) 10Dzahn: udp2log: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) [18:52:16] PROBLEM - puppet last run on elastic2046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:52:20] PROBLEM - puppet last run on dbproxy1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:52:38] im supprised its still failing now [18:52:56] PROBLEM - puppet last run on cloudelastic1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:53:00] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:53:04] the '1 minute ago' is concerning [18:53:04] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 45 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:53:30] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:53:34] RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:53:40] PROBLEM - puppet last run on ms-fe1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:53:48] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:53:52] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:53:54] (03PS3) 10Dzahn: udp2log: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) [18:54:02] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:54:02] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:54:03] looks like i missed a file refrence [18:54:15] 10Puppet, 10Patch-For-Review: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/509489 [18:54:24] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-vhtcpd-stats] [18:54:30] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:54:34] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:54:39] 10Puppet, 10Analytics: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10Dzahn) [18:54:40] (03PS1) 10Jbond: flake8: fix file refrence [puppet] - 10https://gerrit.wikimedia.org/r/509491 [18:54:53] cdanis: can you review && [18:55:02] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 51 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:55:04] PROBLEM - puppet last run on elastic2051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:55:10] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [18:55:15] (03CR) 10CDanis: [C: 03+1] flake8: fix file refrence [puppet] - 10https://gerrit.wikimedia.org/r/509491 (owner: 10Jbond) [18:55:18] 10Puppet, 10Analytics: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10Dzahn) It's been a couple years and this has been called obsolete for a long time but also it's not completely removed yet. [18:55:26] PROBLEM - puppet last run on elastic2043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:55:26] PROBLEM - puppet last run on elastic2045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:55:30] PROBLEM - puppet last run on elastic2033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:55:30] could be slow runs + slow reporting, etc [18:55:30] I donno [18:55:38] no, it is an error [18:55:41] (03CR) 10Jbond: [C: 03+2] flake8: fix file refrence [puppet] - 10https://gerrit.wikimedia.org/r/509491 (owner: 10Jbond) [18:56:16] sorry about this i really did think this would be a harmless change [18:56:34] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10observability: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10Dzahn) [18:56:40] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:56:48] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10observability: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10Dzahn) also T152104 [18:56:49] no worries jbond42! [18:56:50] heh I was very lagged in sending that, obviously :) [18:57:06] (03PS4) 10Dzahn: udp2log: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) [18:57:25] the two lessons here IMO are 1) puppet-merge has consistency issues we should figure out how to fix and 2) our puppet CI is ... sub-optimal [18:57:26] (my IRC froze and then later it sent what I wrote way back in chat history) [18:57:30] waits before merging other stuff [18:57:40] RECOVERY - puppet last run on dbproxy1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:57:41] "reference to a puppet:// path that doesn't exist" is something we should be catching [18:57:50] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:58:09] yeah [18:58:16] RECOVERY - puppet last run on cloudelastic1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:58:19] maybe that in particular, we can find a way to catch it [18:58:32] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only' [18:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:45] but in the general case, it's known that even the puppet-compiler can't actually catch all problems that can break the agent run in the real world [18:58:50] cdanis: i think 2 is relativly easy to improve 1 is a bit tricker i think [18:58:55] yeah agreed [18:59:00] RECOVERY - puppet last run on ms-fe1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:59:06] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:59:12] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:59:18] ok things seem to be recovering [18:59:22] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:59:22] PROBLEM - puppet last run on elastic2054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [18:59:23] I am yet to imagine a way to fix 1) that doesn't involve some tricksiness or changes to puppet serving [18:59:38] Hello [18:59:46] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:59:52] (1) is the same problem almost every system that involves routinely deploying large code+data changes has [19:00:10] it's not that dissimilar from all our BS with train deploys and scap and opcaches and... [19:00:12] if you design for it up-front it isn't as bad 🙃 [19:00:30] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:00:42] again sorry for the noise looks liek things ar recovering now, i promise to push no more till monday :) [19:00:46] deploying more than 1-bit of change in an atomic fashion is difficult (well yeah, assuming you didn't design for it, which almost nobody does) [19:00:46] Elastic nodes are not recovering [19:00:48] RECOVERY - puppet last run on elastic2043 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:00:48] RECOVERY - puppet last run on elastic2045 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [19:00:53] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-wmf-elasticsearch-exporter] [19:00:55] onimisionipe: they should be now, there was one file missed [19:00:59] onimisionipe: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509491/ [19:01:00] Ok [19:01:16] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:01:21] and I'm halfway done with a cumin invocation to rerun puppet everywhere it the last run failed [19:01:26] s/it the/the/ [19:01:32] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:01:42] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:01:48] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [19:01:48] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:02:00] RECOVERY - puppet last run on elastic2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:02:01] Alright cool [19:02:02] RECOVERY - puppet last run on elastic2050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:02:12] Thanks! [19:02:12] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:02:21] out of curiosity onimisionipe did this change break elasticsearch indexing as well? I see the alerts about Mediawiki Cirrussearch update rate [19:02:28] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:02:48] RECOVERY - puppet last run on elastic2042 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:02:58] (or more likely, it just broke the Prometheus reporting thereof...?) [19:03:00] RECOVERY - puppet last run on elastic2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:03:12] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:03:44] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:03:47] cdanis: nah...it's not related. The change is related to one of our exporters [19:03:48] RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:03] (03PS1) 10Dzahn: planet: test if reviwer-bot is broken [puppet] - 10https://gerrit.wikimedia.org/r/509492 [19:04:12] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:24] ok! [19:04:30] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:04:31] afk a bit, need some tea and a snack [19:04:38] cumin run is complete btw [19:04:42] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:42] RECOVERY - puppet last run on elastic2054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:04:50] Ok. Cool [19:04:52] ok cool im out have a nice weekend [19:05:06] * jbond42 will keep an eye on things for ten mins just in case [19:05:10] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:05:16] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:05:46] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:05:48] RECOVERY - puppet last run on elastic2051 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:06:12] RECOVERY - puppet last run on elastic2033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:06:14] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:07:10] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:08:12] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:15:32] (03PS1) 10AndyRussG: [DO NOT MERGE] Test for reviewer-bot [puppet] - 10https://gerrit.wikimedia.org/r/509493 [19:15:38] (03CR) 10jerkins-bot: [V: 04-1] [DO NOT MERGE] Test for reviewer-bot [puppet] - 10https://gerrit.wikimedia.org/r/509493 (owner: 10AndyRussG) [19:17:28] (03Abandoned) 10AndyRussG: [DO NOT MERGE] Test for reviewer-bot [puppet] - 10https://gerrit.wikimedia.org/r/509493 (owner: 10AndyRussG) [19:18:30] (03PS1) 10AndyRussG: [DO NOT MERGE] Test for reviewer-bot [puppet] - 10https://gerrit.wikimedia.org/r/509494 [19:18:48] (03Abandoned) 10Dzahn: planet: test if reviwer-bot is broken [puppet] - 10https://gerrit.wikimedia.org/r/509492 (owner: 10Dzahn) [19:30:59] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [19:39:11] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 371.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:39:15] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:39:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:41:53] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:41:57] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:47:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:55:53] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:11:23] (03PS1) 10Bstorm: toolforge: remove memory overcommit limitations on cron host [puppet] - 10https://gerrit.wikimedia.org/r/509511 (https://phabricator.wikimedia.org/T222255) [20:13:41] PROBLEM - MariaDB Slave Lag: s8 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 815.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:17:47] (03Abandoned) 10AndyRussG: [DO NOT MERGE] Test for reviewer-bot [puppet] - 10https://gerrit.wikimedia.org/r/509494 (owner: 10AndyRussG) [20:31:00] (03PS2) 10Bstorm: toolforge: remove memory overcommit limitations on cron host [puppet] - 10https://gerrit.wikimedia.org/r/509511 (https://phabricator.wikimedia.org/T222255) [20:45:15] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Aklapper) [20:52:24] (03CR) 10Bstorm: "This appears to do the thing without breaking anything else https://puppet-compiler.wmflabs.org/compiler1002/16470/" [puppet] - 10https://gerrit.wikimedia.org/r/509511 (https://phabricator.wikimedia.org/T222255) (owner: 10Bstorm) [20:58:16] (03CR) 10Bstorm: [C: 03+2] toolforge: remove memory overcommit limitations on cron host [puppet] - 10https://gerrit.wikimedia.org/r/509511 (https://phabricator.wikimedia.org/T222255) (owner: 10Bstorm) [21:04:40] (03CR) 10Dzahn: [C: 03+2] udp2log: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:06:03] (03PS5) 10Dzahn: udp2log: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509489 (https://phabricator.wikimedia.org/T197873) [21:24:12] (03PS5) 10Dzahn: mirrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) [21:28:38] (03PS1) 10Dzahn: burrow: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/509532 (https://phabricator.wikimedia.org/T197873) [21:32:34] (03CR) 10Dzahn: "@ottomata This is good to go, we did the same for a bunch of other repos, just that either the checked out dir has to be deleted for a mom" [puppet] - 10https://gerrit.wikimedia.org/r/507077 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [21:33:25] mutante mom? [21:33:54] oh moment [21:34:26] paladox: i think it would help to point out on all these changes that people have to use sed or delete it [21:34:46] i see many have been merged meanwhile :) [21:34:51] yup :) [21:35:26] sed -i -e 's/\/r\/p/\/r\/' [21:35:34] * sed -i -e 's/\/r\/p/\/r\//' [21:36:11] um, * sed -i -e 's/\/r\/p/\/r/' [21:36:39] heh :) i was lazy and just did rm and ran puppet [21:36:52] both work [21:46:48] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Aklapper) 05Open→03Resolved a:05Aklapper→03... [21:49:52] ahah [21:49:53] lol [21:52:06] 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) created S4 procurement ticket for this at T222984 [21:55:32] 10Operations, 10netops: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi) [22:04:56] @seen Ladsgroup [22:04:56] mutante: I have never seen Ladsgroup [22:05:09] mutante Amir1 [22:05:09] :) [22:05:12] Amir1: :) hi, how do you feel about merging the shorturl dumps change [22:05:19] lgtm [22:09:47] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Aklapper) Followup note: [Milestones (which are NO... [22:12:21] running out of battery, bbiaw [22:13:53] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/16471/" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [22:26:19] RECOVERY - MariaDB Slave Lag: s8 on db1116 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:08:49] PROBLEM - puppet last run on wdqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:29:11] (03PS1) 10Paladox: Gerrit: Convert log4j.xml to log4j2.xml [puppet] - 10https://gerrit.wikimedia.org/r/509542 [23:29:19] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:32:16] (03PS2) 10Paladox: Gerrit: Convert log4j.xml to log4j2.xml [puppet] - 10https://gerrit.wikimedia.org/r/509542 [23:35:41] RECOVERY - puppet last run on wdqs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:38:55] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [23:40:18] (03CR) 10Dzahn: [C: 03+2] burrow: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/509532 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)