[00:46:53] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) I'm going to start with the simple case, I'll make it more complicated... [00:54:02] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:55:37] ^ me [00:56:22] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [00:57:29] !log restarted postgresql on netmon2001 as part of debugging issue with failed icinga replication check [00:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:32] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:25:02] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:25:11] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=GET https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:25:41] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:26:32] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:26:52] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:27:21] PROBLEM - puppet last run on mw1327 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:27:31] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:27:31] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:27:32] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:30:12] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:30:20] I got some weird intermittent error at https://phabricator.wikimedia.org/T56859 [01:30:23] On refresh, it went away. [01:31:41] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:51] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:01] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:12] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:31] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:36:01] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:36:51] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:39:02] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:39:31] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:52:42] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:53:12] RECOVERY - puppet last run on mw1327 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [01:53:32] PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [01:54:31] PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [01:55:51] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:56:52] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:57:21] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:57:32] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:57:52] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:58:32] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:58:32] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=GET https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:00:52] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:00:52] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:01:11] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:02:02] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:02:32] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:02:51] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:09:41] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:10:21] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.21 [02:10:22] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:10:52] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP] [02:10:52] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:11:11] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:11:32] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:11:32] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:11:41] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:11:41] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:11:52] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:12:51] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:12:52] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:12:52] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:13:11] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:13:22] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:13:32] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:14:22] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:15:01] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [02:21:51] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:29:52] RECOVERY - HP RAID on ms-be1036 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [02:36:42] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:37:22] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:40:21] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:40:42] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:41:22] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:29:01] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 872.77 seconds [03:40:51] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 276.45 seconds [03:58:02] PROBLEM - Memory correctable errors -EDAC- on db1069 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [04:40:51] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:56:56] 10Operations, 10ops-codfw, 10DBA: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) All good! ``` root@db2061:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F3720) Port Name: 1I Port Name: 2I Gen8 Serv... [04:57:03] 10Operations, 10ops-codfw, 10DBA: db2061 disk with predictive failure - https://phabricator.wikimedia.org/T200059 (10Marostegui) 05Open>03Resolved [05:05:02] PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:05:11] RECOVERY - haproxy failover on dbproxy1006 is OK: OK check_failover servers up 2 down 0 [05:06:02] RECOVERY - haproxy failover on dbproxy1001 is OK: OK check_failover servers up 2 down 0 [05:15:16] 10Operations, 10ops-eqiad, 10DBA: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10Marostegui) [05:15:24] 10Operations, 10ops-eqiad, 10DBA: es1019 mgmt interface DOWN - https://phabricator.wikimedia.org/T201132 (10Marostegui) p:05Triage>03Normal [05:22:27] 10Operations, 10ops-eqiad, 10DBA: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) 05Open>03stalled p:05Triage>03Normal [05:23:13] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on db1069 is CRITICAL: 4 ge 4 Marostegui T201133 - The acknowledgement expires at: 2018-09-11 05:22:54. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [05:25:24] 10Operations, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10mmodell) @akosiaris: Agreed, it sounds like a good idea to verify the umask in scap. [06:21:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [06:28:37] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/spark2_yarn_shuffle_jar_install] [06:31:48] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sudoers] [06:52:57] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:53:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:53:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:54:17] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [06:55:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:55:49] one single spike afaics, seems a temp issue with codfw varnish backends [06:56:07] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [06:56:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:56:36] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:56:47] still network issues then? [06:56:51] https://grafana.wikimedia.org/dashboard/db/network-performances-global?panelId=20&fullscreen&orgId=1 [06:57:28] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] 10Operations, 10vm-requests, 10Patch-For-Review, 10User-herron: eqiad: (1) VM request for Archiva - https://phabricator.wikimedia.org/T200895 (10elukey) Thanks!!! [06:57:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:59:37] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:36] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [07:03:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [07:03:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:04:53] according to varnishospital https://logstash.wikimedia.org/goto/d3f2984edd7020eb57a1db098dd4ef7a the spike affected most of {text,misc}_{eqiad,codfw}: https://phabricator.wikimedia.org/P7425 [07:08:27] (03CR) 10Jcrespo: "I don't have a strong feeling about this, but if it is fixed, shouldn't it be put inside a tendril/dbmonitor::common profile, rather than " [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [07:20:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Some https calls still registered, I tried to add more https_proxy settings and I opened T201134 to follow up with the author o... [07:48:42] (03PS2) 10Elukey: Release 2.2.3-1 [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/450081 (https://phabricator.wikimedia.org/T192639) [07:50:17] (03CR) 10Elukey: "Also added a README :)" [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/450081 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [07:59:04] (03CR) 10Elukey: [C: 032] Release 2.2.3-1 [debs/archiva] (debian) - 10https://gerrit.wikimedia.org/r/450081 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [08:09:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450193 [08:10:39] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, multi line records is the only case I can think where it would make sense to not have OPS, though none or json/syslog/udp2log is usi" [puppet] - 10https://gerrit.wikimedia.org/r/449913 (https://phabricator.wikimedia.org/T200960) (owner: 10BBlack) [08:11:13] (03CR) 10Filippo Giunchedi: [C: 032] logstash: default to 4MB receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/450028 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:11:26] (03PS3) 10Filippo Giunchedi: logstash: default to 4MB receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/450028 (https://phabricator.wikimedia.org/T200960) [08:11:30] (03CR) 10Filippo Giunchedi: [C: 032] logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:11:38] (03PS3) 10Filippo Giunchedi: logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) [08:11:43] (03CR) 10Filippo Giunchedi: [C: 032] logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:11:57] (03PS4) 10Filippo Giunchedi: logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) [08:14:02] (03PS4) 10Filippo Giunchedi: logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) [08:14:04] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] logstash: remove multiline filter [puppet] - 10https://gerrit.wikimedia.org/r/450026 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:14:22] (03PS5) 10Filippo Giunchedi: logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) [08:14:25] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] logstash: use default number of queue workers [puppet] - 10https://gerrit.wikimedia.org/r/450027 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:16:59] <_joe_> godog: isn't the multiline filter used for java stacktraces? [08:17:12] <_joe_> maybe we don't use it [08:17:33] _joe_: no, java goes through gelf iirc and it already knows about multi line [08:17:49] gelf or log4j or what the name is nowadays [08:17:54] <_joe_> yeah if they use gelf that's ok [08:18:03] <_joe_> log4j is the library, gelf is the format [08:19:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450193 (owner: 10Marostegui) [08:19:45] ah and logback [08:19:50] which is also a library [08:20:53] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450193 (owner: 10Marostegui) [08:21:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450193 (owner: 10Marostegui) [08:23:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 (duration: 00m 50s) [08:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:19] !log filippo@neodymium conftool action : set/pooled=no; selector: name=logstash1007.eqiad.wmnet [08:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:47] (03PS1) 10Filippo Giunchedi: logstash: default workers to processorcount [puppet] - 10https://gerrit.wikimedia.org/r/450195 (https://phabricator.wikimedia.org/T200960) [08:25:55] (03CR) 10Filippo Giunchedi: [C: 032] logstash: default workers to processorcount [puppet] - 10https://gerrit.wikimedia.org/r/450195 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:28:46] 10Operations, 10netops: Intermitent connectivity issues between eqiad servers? - https://phabricator.wikimedia.org/T201139 (10jcrespo) [08:31:16] 10Operations, 10netops: Intermitent connectivity issues between eqiad servers? - https://phabricator.wikimedia.org/T201139 (10jcrespo) [08:31:52] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Puppetize the installation of PHP-FPM on the MediaWiki hosts - https://phabricator.wikimedia.org/T201140 (10Joe) p:05Triage>03Normal [08:47:40] 10Operations, 10netops: Intermitent connectivity issues between eqiad servers? - https://phabricator.wikimedia.org/T201139 (10Joe) [08:48:44] (03PS1) 10Jcrespo: mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) [08:49:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [08:53:55] (03PS2) 10Jcrespo: mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) [08:54:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [08:56:20] (03PS3) 10Jcrespo: mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) [08:57:37] (03CR) 10Jcrespo: "This should be deployed after https://gerrit.wikimedia.org/r/450046" [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [08:58:09] (03CR) 10Marostegui: "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:04:37] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10faidon) p:05Triage>03High [09:10:03] 10Operations, 10ops-eqiad: dbproxy1006 iDRAC IP conflict - https://phabricator.wikimedia.org/T201148 (10faidon) p:05Triage>03High [09:15:03] 10Operations, 10netops: cr1/2-eqiad PFE_FW_SYSLOG_IP6_GEN log entries - https://phabricator.wikimedia.org/T201149 (10faidon) p:05Triage>03High [09:24:36] (03PS2) 10Gehel: elasticsearch: migrate relforge to stretch [puppet] - 10https://gerrit.wikimedia.org/r/450060 (https://phabricator.wikimedia.org/T193649) [09:25:49] (03CR) 10Gehel: [C: 032] elasticsearch: migrate relforge to stretch [puppet] - 10https://gerrit.wikimedia.org/r/450060 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [09:30:11] (03PS1) 10Alexandros Kosiaris: Allow the deploy user to get pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/450201 (https://phabricator.wikimedia.org/T199489) [09:31:36] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational [09:32:07] (03PS2) 10Alexandros Kosiaris: Allow the deploy user to get pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/450201 (https://phabricator.wikimedia.org/T199489) [09:35:25] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10faidon) I'm investigating unrelated issues in asw2-b-eqiad and these ports are flapping (probably boot-looping into PXE), so I dis... [09:35:38] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10faidon) I'm investigating unrelated issues in asw2-b-eqiad and this port is flapping (probably boot-looping into PXE), so I disabled it. @Ro... [09:36:06] (03PS3) 10Elukey: Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [09:36:23] !log disabling puppet on mariadb multiinstance hosts [09:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:30] !log asw2-b-eqiad: set disable for xe-2/0/5 (cloudvirt1023), xe-7/0/22 (cloudvirt1024), ge-8/0/23 (rdb1009) [09:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:38] (03PS4) 10Elukey: Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [09:40:16] (03CR) 10Giuseppe Lavagetto: Allow the deploy user to get pod logs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/450201 (https://phabricator.wikimedia.org/T199489) (owner: 10Alexandros Kosiaris) [09:40:49] (03CR) 10Jcrespo: [C: 032] Upgrade check_mariadb.py to the latest WMFMariaDB version [puppet] - 10https://gerrit.wikimedia.org/r/450046 (owner: 10Jcrespo) [09:40:58] (03PS3) 10Jcrespo: Upgrade check_mariadb.py to the latest WMFMariaDB version [puppet] - 10https://gerrit.wikimedia.org/r/450046 [09:41:47] (03CR) 10Jcrespo: [C: 032] mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:41:55] (03PS4) 10Jcrespo: mariadb: Introduce a read_only check for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/450199 (https://phabricator.wikimedia.org/T172489) [09:49:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450202 [09:50:20] !log testing new monitoring on dbstore2001 [09:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:01] (03PS1) 10Ema: 7.1.3+ds-4wm2: do not start the service on install or upgrade [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/450203 (https://phabricator.wikimedia.org/T200178) [09:52:19] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:53:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450202 (owner: 10Marostegui) [09:53:28] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:54:29] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), and 2 others: Puppetize the installation of PHP-FPM on the MediaWiki hosts - https://phabricator.wikimedia.org/T201140 (10Joe) a:03Joe [09:54:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450202 (owner: 10Marostegui) [09:55:08] (03PS16) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [09:55:25] (03CR) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library (037 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:56:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 (duration: 00m 49s) [09:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450202 (owner: 10Marostegui) [09:58:25] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [09:59:23] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) Thank you @Pchelolo ! Before we can merge and deploy the above PR, though, we need to be able to monitor the service in producti... [10:03:39] PROBLEM - MariaDB read only s7 on dbstore2001 is CRITICAL: Could not connect to localhost:3317 [10:03:39] PROBLEM - MariaDB read only s5 on dbstore2001 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6813347s, 5 mysqld process(es), s5 lag: 0.17s, 12 client(s), 31.62 QPS, connection latency: 0.005208s, query latency: 0.001704s [10:03:39] PROBLEM - MariaDB read only s6 on dbstore2001 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6813395s, 5 mysqld process(es), s6 lag: 0.17s, 12 client(s), 110.29 QPS, connection latency: 0.005661s, query latency: 0.001020s [10:03:39] PROBLEM - MariaDB read only s2 on dbstore2001 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6813227s, 5 mysqld process(es), s2 lag: 0.17s, 12 client(s), 64.69 QPS, connection latency: 0.005638s, query latency: 0.000638s [10:03:39] PROBLEM - MariaDB read only s8 on dbstore2001 is CRITICAL: CRIT: read_only: False, expected True, s5 lag is 17812588.21s: OK: Version 10.1.33-MariaDB, Uptime 6813341s, 5 mysqld process(es), s8 lag: 0.21s, 12 client(s), 134.76 QPS, connection latency: 0.005333s, query latency: 0.000933s [10:03:55] ^ those are tests [10:04:14] well, but they are right :-) [10:04:56] You get what I meant ;) [10:05:12] no, I mean the tests revealed a real issue [10:05:33] oh, it wasn't in RO? [10:06:30] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), and 2 others: Puppetize the installation of PHP-FPM on the MediaWiki hosts - https://phabricator.wikimedia.org/T201140 (10Joe) Looking at the modules tagged php on puppetforge: * `thias/php` seems well organized in terms of reso... [10:08:40] (03PS1) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [10:09:23] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:10:26] (03PS5) 10Elukey: Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [10:11:02] (03CR) 10jerkins-bot: [V: 04-1] Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [10:12:41] (03PS2) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [10:13:14] (03PS1) 10Jcrespo: mariadb: Force as read only dbstore_multiinstance and santarium_mi [puppet] - 10https://gerrit.wikimedia.org/r/450205 (https://phabricator.wikimedia.org/T172489) [10:13:21] (03CR) 10Ema: [C: 032] 7.1.3+ds-4wm2: do not start the service on install or upgrade [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/450203 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:13:23] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [10:17:38] (03PS3) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [10:17:46] (03PS1) 10Jcrespo: mariadb: Force nagios as the user if check is not running as root [puppet] - 10https://gerrit.wikimedia.org/r/450206 (https://phabricator.wikimedia.org/T172489) [10:23:18] (03PS2) 10Jcrespo: mariadb: Force as read only dbstore_multiinstance and santarium_mi [puppet] - 10https://gerrit.wikimedia.org/r/450205 (https://phabricator.wikimedia.org/T172489) [10:23:20] (03PS2) 10Jcrespo: mariadb: Force nagios as the user if check is not running as root [puppet] - 10https://gerrit.wikimedia.org/r/450206 (https://phabricator.wikimedia.org/T172489) [10:23:49] (03PS4) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [10:24:13] (03PS3) 10Jcrespo: mariadb: Force as read only dbstore_multiinstance and sanitarium_m.i. [puppet] - 10https://gerrit.wikimedia.org/r/450205 (https://phabricator.wikimedia.org/T172489) [10:25:13] (03PS3) 10Jcrespo: mariadb: Force nagios as the user if check is not running as root [puppet] - 10https://gerrit.wikimedia.org/r/450206 (https://phabricator.wikimedia.org/T172489) [10:26:36] (03CR) 10Jcrespo: [C: 032] mariadb: Force nagios as the user if check is not running as root [puppet] - 10https://gerrit.wikimedia.org/r/450206 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [10:26:45] (03PS4) 10Jcrespo: mariadb: Force nagios as the user if check is not running as root [puppet] - 10https://gerrit.wikimedia.org/r/450206 (https://phabricator.wikimedia.org/T172489) [10:27:05] (03CR) 10Jcrespo: [C: 032] mariadb: Force as read only dbstore_multiinstance and sanitarium_m.i. [puppet] - 10https://gerrit.wikimedia.org/r/450205 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [10:27:27] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10mobrovac) +1, this rate looks acceptable to me. Thank you, @EBernhardson ! @Ottomata do we have to also twe... [10:27:38] (03PS5) 10Jcrespo: mariadb: Force nagios as the user if check is not running as root [puppet] - 10https://gerrit.wikimedia.org/r/450206 (https://phabricator.wikimedia.org/T172489) [10:29:17] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10faidon) So... what's the status of this? What else has been observed, what has been done to troubleshoot and what's the latest from Juniper? I tried to access the Juniper case for m... [10:31:58] !log setting read only=1 on all instances on dbstore2001 [10:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:18] RECOVERY - MariaDB read only s2 on dbstore2001 is OK: Version 10.1.33-MariaDB, Uptime 6814945s, 5 mysqld process(es), read_only: True, s2 lag: 1.24s, 12 client(s), 301.70 QPS, connection latency: 0.004269s, query latency: 0.000763s [10:32:39] ^apparently works as intended [10:32:45] ^ marostegui [10:33:03] (03CR) 10Hoo man: [C: 031] "Looks sensible :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [10:33:09] Nice! [10:33:22] and it literally fixed a production issue [10:33:28] RECOVERY - MariaDB read only s5 on dbstore2001 is OK: Version 10.1.33-MariaDB, Uptime 6815135s, 5 mysqld process(es), read_only: True, s5 lag: 0.19s, 12 client(s), 17.67 QPS, connection latency: 0.003787s, query latency: 0.000851s [10:33:28] RECOVERY - MariaDB read only s6 on dbstore2001 is OK: Version 10.1.33-MariaDB, Uptime 6815183s, 5 mysqld process(es), read_only: True, s6 lag: 0.20s, 12 client(s), 141.13 QPS, connection latency: 0.003674s, query latency: 0.000798s [10:33:36] and also detected a grant issue on s7 [10:33:58] (03PS17) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [10:34:39] RECOVERY - MariaDB read only s7 on dbstore2001 is OK: Version 10.1.33-MariaDB, Uptime 6815081s, 5 mysqld process(es), read_only: True, s7 lag: 0.00s, 13 client(s), 178.35 QPS, connection latency: 0.004027s, query latency: 0.000772s [10:35:48] wikibugs lagging? [10:36:21] no, the check happens every 5 minutes or so [10:36:51] or, sorry, ignore me, I though you were commenting on incinga bot [10:36:57] not sure about bugs [10:38:29] RECOVERY - MariaDB read only s8 on dbstore2001 is OK: Version 10.1.33-MariaDB, Uptime 6815433s, 5 mysqld process(es), read_only: True, s8 lag: 0.52s, 13 client(s), 181.63 QPS, connection latency: 0.003514s, query latency: 0.001915s [10:38:55] should I revert, test more? [10:39:53] !log Running populateSitesTable.php on all Wikidata clients for T201003 [10:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:57] T201003: Run wikidata scripts on new wikis - https://phabricator.wikimedia.org/T201003 [10:47:26] (03PS1) 10Jcrespo: check_mariadb: Disable for now checks that are not read-only [puppet] - 10https://gerrit.wikimedia.org/r/450209 (https://phabricator.wikimedia.org/T172489) [10:50:19] (03CR) 10Jcrespo: [C: 032] check_mariadb: Disable for now checks that are not read-only [puppet] - 10https://gerrit.wikimedia.org/r/450209 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [10:51:50] I expect another list of errors happening now (it is a test, but the errors will be real) [10:54:21] PROBLEM - MariaDB read only s1 on dbstore2002 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6814786s, 5 mysqld process(es), s1 lag: 0.17s, 14 client(s), 144.71 QPS, connection latency: 0.006011s, query latency: 0.000992s [10:54:21] PROBLEM - MariaDB read only s4 on dbstore2002 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6814513s, 5 mysqld process(es), s4 lag: 0.17s, 18 client(s), 240.73 QPS, connection latency: 0.003792s, query latency: 0.000765s [10:54:22] PROBLEM - MariaDB read only s2 on dbstore2002 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6814616s, 5 mysqld process(es), s2 lag: 0.17s, 12 client(s), 106.98 QPS, connection latency: 0.006029s, query latency: 0.001143s [10:54:22] PROBLEM - MariaDB read only x1 on dbstore2002 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6814254s, 5 mysqld process(es), x1 lag: 0.17s, 18 client(s), 29.94 QPS, connection latency: 0.004750s, query latency: 0.000669s [10:54:22] PROBLEM - MariaDB read only s3 on dbstore2002 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6814200s, 5 mysqld process(es), s3 lag: 0.48s, 18 client(s), 1357.46 QPS, connection latency: 0.003806s, query latency: 0.001182s [10:54:38] !log setting dbstore2002 instances as read only [10:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:30] RECOVERY - MariaDB read only s3 on dbstore2002 is OK: Version 10.1.33-MariaDB, Uptime 6814261s, 5 mysqld process(es), read_only: True, s3 lag: 0.32s, 18 client(s), 152.66 QPS, connection latency: 0.007192s, query latency: 0.001467s [10:56:02] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Rossi.dario.g) Dear all, I have been doing my homework and in particular: * signed the L3 agreement document * wikitech username = "Dario Rossi" (or "drossi") * preferred shell... [10:56:30] RECOVERY - MariaDB read only s4 on dbstore2002 is OK: Version 10.1.33-MariaDB, Uptime 6814638s, 5 mysqld process(es), read_only: True, s4 lag: 0.00s, 18 client(s), 195.79 QPS, connection latency: 0.003118s, query latency: 0.001684s [10:56:30] RECOVERY - MariaDB read only s2 on dbstore2002 is OK: Version 10.1.33-MariaDB, Uptime 6814741s, 5 mysqld process(es), read_only: True, s2 lag: 0.00s, 12 client(s), 117.74 QPS, connection latency: 0.004076s, query latency: 0.000684s [10:56:30] RECOVERY - MariaDB read only x1 on dbstore2002 is OK: Version 10.1.33-MariaDB, Uptime 6814379s, 5 mysqld process(es), read_only: True, x1 lag: 0.00s, 18 client(s), 35.64 QPS, connection latency: 0.005233s, query latency: 0.001048s [10:58:00] RECOVERY - MariaDB read only s1 on dbstore2002 is OK: Version 10.1.33-MariaDB, Uptime 6815001s, 5 mysqld process(es), read_only: True, s1 lag: 0.01s, 14 client(s), 110.17 QPS, connection latency: 0.005431s, query latency: 0.000839s [10:58:35] (03PS6) 10Elukey: Remove geowiki cron jobs and make puppet delete related files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [11:09:43] (03PS1) 10Jcrespo: mariadb-sanitarium: Remove duplicate log-slave-updates on config [puppet] - 10https://gerrit.wikimedia.org/r/450211 (https://phabricator.wikimedia.org/T172489) [11:13:06] (03CR) 10Marostegui: [C: 031] mariadb-sanitarium: Remove duplicate log-slave-updates on config [puppet] - 10https://gerrit.wikimedia.org/r/450211 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:15:27] (03CR) 10Volans: [C: 04-1] "Some replies inline, I didn't yet checked the code yet though" (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [11:16:05] ops, didn't meant the -1, it was prefilled [11:16:24] (03CR) 10Jcrespo: [C: 032] mariadb-sanitarium: Remove duplicate log-slave-updates on config [puppet] - 10https://gerrit.wikimedia.org/r/450211 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:16:26] (03CR) 10Elukey: "Hey Fran, I tweaked a bit the code change to allow a more deep cleanup of things. I left a question for Andrew related to bacula backups, " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450040 (https://phabricator.wikimedia.org/T190059) (owner: 10Fdans) [11:16:32] volans: gerrit knows your behaviour [11:17:03] ahahhah [11:17:18] machine learning applied to Riccardo's reviews [11:17:46] rotfl [11:18:09] (03CR) 10Alexandros Kosiaris: Allow the deploy user to get pod logs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/450201 (https://phabricator.wikimedia.org/T199489) (owner: 10Alexandros Kosiaris) [11:20:16] PROBLEM - MariaDB read only s1 on dbstore1001 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.32-MariaDB, Uptime 8221513s, 418.96 QPS, connection latency: 0.002797s, query latency: 0.000832s [11:22:58] (03PS1) 10Jcrespo: mariadb-check: Allow duplicate definitions on /etc/my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/450213 (https://phabricator.wikimedia.org/T172489) [11:23:49] (03CR) 10Jcrespo: [C: 032] mariadb-check: Allow duplicate definitions on /etc/my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/450213 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:27:47] !log setting dbstore1001:s1 mariadb as read_only=1 [11:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] RECOVERY - MariaDB read only s1 on dbstore1001 is OK: Version 10.1.32-MariaDB, Uptime 8222011s, read_only: True, 94.71 QPS, connection latency: 0.002755s, query latency: 0.000882s [11:41:16] PROBLEM - MariaDB read only s8 on db1124 is CRITICAL: Could not connect to localhost:3318 [11:41:37] we are fixing that^ [11:43:07] (03PS1) 10Alexandros Kosiaris: Helm test: Use environment variables for service-checker [deployment-charts] - 10https://gerrit.wikimedia.org/r/450215 (https://phabricator.wikimedia.org/T199489) [11:43:17] RECOVERY - MariaDB read only s8 on db1124 is OK: Version 10.1.33-MariaDB, Uptime 4600668s, read_only: True, 387.60 QPS, connection latency: 0.002511s, query latency: 0.001131s [11:44:48] (03PS2) 10Alexandros Kosiaris: Helm test: Use environment variables for service-checker [deployment-charts] - 10https://gerrit.wikimedia.org/r/450215 (https://phabricator.wikimedia.org/T199489) [11:52:38] PROBLEM - MariaDB read only m3 on db2078 is CRITICAL: Could not connect to localhost:3323 [11:54:19] PROBLEM - MariaDB read only m5 on db2078 is CRITICAL: Could not connect to localhost:3325 [11:58:16] checking [11:58:25] I am going thru misc now [11:58:28] I will fix it [11:58:38] oh, I see [12:00:14] RECOVERY - MariaDB read only m3 on db2078 is OK: Version 10.1.33-MariaDB, Uptime 181892s, read_only: True, 22.99 QPS, connection latency: 0.002701s, query latency: 0.000831s [12:02:45] RECOVERY - MariaDB read only m5 on db2078 is OK: Version 10.1.33-MariaDB, Uptime 6145915s, read_only: True, 14.74 QPS, connection latency: 0.002820s, query latency: 0.000779s [12:05:28] !log reimage relforge to stretch - T193649 [12:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:33] T193649: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 [12:06:06] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['relforge1001.eqiad.wmnet'] ``` The log... [12:09:55] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 98 threshold =0.15 breach: status: red, number_of_nodes: 1, unassigned_shards: 98, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 102, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 51.0 [12:09:55] 2, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [12:10:15] PROBLEM - MariaDB read only m1 on db2078 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6209836s, 127.86 QPS, connection latency: 0.002825s, query latency: 0.000733s [12:10:29] ^ will fix that [12:12:05] PROBLEM - MariaDB read only m2 on db2078 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.33-MariaDB, Uptime 6204407s, 13.38 QPS, connection latency: 0.002461s, query latency: 0.000695s [12:13:05] RECOVERY - MariaDB read only m2 on db2078 is OK: Version 10.1.33-MariaDB, Uptime 6204467s, read_only: True, 12.45 QPS, connection latency: 0.002980s, query latency: 0.000697s [12:14:24] RECOVERY - MariaDB read only m1 on db2078 is OK: Version 10.1.33-MariaDB, Uptime 6210085s, read_only: True, 104.24 QPS, connection latency: 0.002736s, query latency: 0.000901s [12:16:11] ACKNOWLEDGEMENT - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 98 threshold =0.15 breach: status: red, number_of_nodes: 1, unassigned_shards: 98, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 102, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_numb [12:16:11] ards: 102, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 Gehel reimage in progress - https://phabricator.wikimedia.org/T193649 [12:30:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Tested this, both locally on minikube as well as the ci namespace on the staging cluster and works fine. Merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/450215 (https://phabricator.wikimedia.org/T199489) (owner: 10Alexandros Kosiaris) [12:30:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Allow the deploy user to get pod logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/450201 (https://phabricator.wikimedia.org/T199489) (owner: 10Alexandros Kosiaris) [12:33:04] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10akosiaris) 05Open>03Resolved a:03akosiaris Aside from the RBAC rights fix, I 've also did a small change in the helm test and no... [12:33:37] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,run_podsandbox,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:35:46] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:51:45] !log trafficserver 7.1.3+ds-4wm2 uploaded to stretch-wikimedia T200178 [12:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:49] T200178: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 [12:57:21] PROBLEM - Host multatuli is DOWN: PING CRITICAL - Packet loss = 100% [12:57:52] ^ that's me [12:58:00] 10Operations, 10ops-eqiad, 10Traffic: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) [12:58:31] RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [12:59:03] (03CR) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [12:59:16] (03PS18) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [12:59:19] 10Operations, 10ops-eqiad, 10Traffic: cp1085 bad DAC/SFP? - https://phabricator.wikimedia.org/T201175 (10BBlack) [12:59:54] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) a:05Cmjohnson>03BBlack Hmmm maybe subtasks are better, setting some of those up: T201174 + T201175 [13:00:12] 10Operations, 10ops-eqiad, 10Traffic: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) [13:00:14] 10Operations, 10ops-eqiad, 10Traffic: cp1085 bad DAC/SFP? - https://phabricator.wikimedia.org/T201175 (10BBlack) [13:00:16] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) [13:06:07] !log reboot cp3030 to SSBD-enabled microcode/kernel [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] 10Operations, 10netops: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10jcrespo) [13:28:33] (03PS1) 10Jcrespo: mariadb-check: Get password from key clientlabsdb and not labsdb [puppet] - 10https://gerrit.wikimedia.org/r/450220 (https://phabricator.wikimedia.org/T172489) [13:35:12] (03CR) 10Jcrespo: [C: 032] mariadb-check: Get password from key clientlabsdb and not labsdb [puppet] - 10https://gerrit.wikimedia.org/r/450220 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [13:41:17] 10Operations, 10ops-eqiad, 10Traffic: cp1085 bad DAC/SFP? - https://phabricator.wikimedia.org/T201175 (10Cmjohnson) @bblack I replaced both sfp+'s please try again and let me know if the problem persists. [13:42:39] 10Operations, 10ops-eqiad, 10Traffic: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) Description: A problem was detected in Memory Reference Code (MRC). ------------------------------------------------------------------------------- Record: 79 Date/Time... [13:45:36] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['relforge1001.eqiad.wmnet'] ``` and were **ALL** successful. [13:48:03] 10Operations, 10ops-eqiad, 10Traffic: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) I swapped DIMM in A5 with DIMM in B5 to see if the error follows the DIMM. Cleared the log [13:49:32] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1085.eqiad.wmnet'] ``` The log can be found in `/var/log/w... [13:55:08] (03PS1) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [13:55:27] 10Operations, 10ops-eqiad, 10Traffic: cp1085 bad DAC/SFP? - https://phabricator.wikimedia.org/T201175 (10BBlack) Installer launches over PXE fine now, fixed :) [13:55:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [13:59:19] 10Operations, 10ops-eqiad: dbproxy1006 iDRAC IP conflict - https://phabricator.wikimedia.org/T201148 (10Cmjohnson) 05Open>03Resolved I don't know how that even happened but the drac was misconfigured with the wrong IP address. I corrected the setting on the server [13:59:30] PROBLEM - Host dbproxy1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:59:39] (03PS2) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [14:00:02] cmjohnson1: does it need a dns change or something? [14:00:09] maybe a monitoring puppet run? [14:00:14] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:01:06] jynus I need to look at the idrac settings for dbproxy1006... can I take it down for 5 mins [14:01:20] oh, I was just trying to help [14:01:36] I thought you had done everthing due to the resolution [14:01:37] sorry [14:01:58] I thought I did but to see dbproxy mgmt go down now....says something isn't right on that server [14:01:59] (03PS3) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [14:11:15] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.15/includes/Storage/DerivedPageDataUpdater.php: Fix article counting logic in DerivedPageDataUpdater (duration: 00m 50s) [14:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:30] !log stopping haproxy on dbproxy1006 [14:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:32] cmjohnson1: dbproxy1006 is completely inactive right now, and downtimed on icinga, including mgmt [14:14:33] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:15:26] thx jynus [14:20:01] (03CR) 10Jcrespo: "This conflicts with https://gerrit.wikimedia.org/r/449742 , depending which goes first, it will need changes to not use ::mw_primary" [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [14:20:43] RECOVERY - Host dbproxy1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms [14:21:09] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10fgiunchedi) We'll need to add `jmx_exporter` to Logstash too, to get JVM stats like most other JVMs on the fleet. [14:25:54] apergos: any idea why cleanup_old_xmldumps.py might have failed on labstore1006 off the top of your head? [14:26:15] Before I dive in and start poking [14:27:35] is that for a new wiki? maybe those fles don't exist yet [14:27:42] they will and then it will shut p [14:27:56] Ahah! That might be it [14:28:10] and maybe the files had already made it over to labstore1007 before the cleanup there [14:28:12] all guesses [14:28:27] but since I didn't see whines from both hosts it went right into my ignore filter [14:29:24] added june 25 to apache conf so yeah it's new [14:30:00] Fair enough. The files are there right now, but they may not have been on that script run [14:30:23] Thanks for the insight :) [14:30:27] sure [14:38:33] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10Cmjohnson) If we need to swap this cable, we will need to order more 5M 40G QFSP+ cables. I only have 3M spares. I don't know if we want to use the Fiberstore brand....we had a few bad cables duri... [14:44:05] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10Cmjohnson) a:05Cmjohnson>03RobH @robh replaced the cable....i see a link now. Should be good to go [14:45:22] 10Operations, 10ops-eqiad: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10Cmjohnson) @herron is it okay to take this server down? [14:45:33] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:46:43] 10Operations, 10ops-eqiad, 10DC-Ops: Replace memory bank on scb1002 - https://phabricator.wikimedia.org/T196901 (10Cmjohnson) @joe please let me know when it's okay to take this down? We can schedule for Tuesday 7 August. [14:48:08] 10Operations, 10ops-eqiad, 10Traffic: cp1068 memory correctable errors - https://phabricator.wikimedia.org/T194757 (10Cmjohnson) The server will need to be powered down to reseat DIMM...please schedule a day/time with me. [14:49:01] 10Operations, 10ops-eqiad: kafka1023 correctable memory errors - https://phabricator.wikimedia.org/T194249 (10Cmjohnson) The server will need to be powered off please let me know a good day/time to do this. [14:56:48] 10Operations, 10ops-eqiad: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10herron) >>! In T198398#4476752, @Cmjohnson wrote: > @herron is it okay to take this server down? Hey @Cmjohnson, yes mw1239 is depooled and I've just set 2 hours downtime for the host and it's mgmt inter... [14:57:01] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Close https://lists.wikimedia.org/mailman/listinfo/cep and keep the archive for now - https://phabricator.wikimedia.org/T155683 (10Qgil) a:05Qgil>03None The title and description of this task are correct, and they seem to [provide everything #... [15:00:44] 10Operations, 10ops-eqiad, 10User-herron: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10herron) [15:04:19] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171 (10Cmjohnson) [15:05:10] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp1085.eqiad.wmnet'] ``` and were **ALL** successful. [15:05:42] 10Operations, 10ops-eqiad, 10Traffic: cp1068 memory correctable errors - https://phabricator.wikimedia.org/T194757 (10BBlack) 05Open>03declined Let's just skip this, it's one of the servers we'll be decomming once cp1075-90 are rolled into service. [15:06:02] 10Operations, 10ops-eqiad, 10netops: replace mr1-eqiad - https://phabricator.wikimedia.org/T185171 (10Cmjohnson) 05Open>03Resolved @faidon I did not change racktables until it was removed from the rack. The old srx is now removed and added to the decom tracking sheet. I updated racktables to reflect the... [15:10:43] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) [15:10:45] 10Operations, 10ops-eqiad, 10Traffic: cp1085 bad DAC/SFP? - https://phabricator.wikimedia.org/T201175 (10BBlack) 05Open>03Resolved [15:11:04] 10Operations, 10ops-eqiad, 10User-herron: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10Cmjohnson) Thx @herron, I swapped the DIMM from A side to B side. Let's see if that reseating corrects the error. [15:12:06] 10Operations, 10ops-eqiad, 10Traffic: cp1050 apparently stuck while "Initializing firmware interfaces..." - https://phabricator.wikimedia.org/T171168 (10BBlack) 05Open>03declined To be decommed in the next couple of weeks, no point [15:12:28] 10Operations, 10ops-eqiad, 10Traffic: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252 (10BBlack) 05Open>03declined To be decommed in the next couple of weeks, no point! [15:13:44] (03PS1) 10Ladsgroup: Use the correct destination for jobs.wikimedia.org and similar [puppet] - 10https://gerrit.wikimedia.org/r/450232 [15:17:10] (03PS1) 10Herron: install_server: add archiva1001 to dhcp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/450233 (https://phabricator.wikimedia.org/T200895) [15:18:43] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1051 - https://phabricator.wikimedia.org/T195484 (10Cmjohnson) [15:19:44] (03CR) 10Herron: [C: 032] install_server: add archiva1001 to dhcp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/450233 (https://phabricator.wikimedia.org/T200895) (owner: 10Herron) [15:21:44] (03CR) 10Dzahn: [C: 032] "Alex, sorry, i somehow read "already uploaded" like "already cherry-picked"" [puppet] - 10https://gerrit.wikimedia.org/r/450078 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [15:23:24] (03PS1) 10Dzahn: postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 [15:23:47] (03CR) 10jerkins-bot: [V: 04-1] postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 (owner: 10Dzahn) [15:26:49] (03PS2) 10Dzahn: postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 [15:27:26] (03CR) 10jerkins-bot: [V: 04-1] postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 (owner: 10Dzahn) [15:30:25] (03PS19) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [15:38:08] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, 10User-herron: Send logstash service metrics to prometheus - https://phabricator.wikimedia.org/T200362 (10fgiunchedi) >>! In T200362#4464250, @fgiunchedi wrote: > I took a look at both metrics and it seems https://github.com/BonnierNew... [15:40:59] (03PS20) 10Vgutierrez: WIP: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [15:45:16] !log herron@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1239.eqiad.wmnet [15:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:36] (03PS21) 10Vgutierrez: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [15:48:56] \o/ [15:50:04] 10Operations, 10ops-eqiad, 10User-herron: mw1239 correctable memory errors - https://phabricator.wikimedia.org/T198398 (10herron) 05Open>03Resolved a:03herron Great! Host has been repooled and we'll see if this reoccurs. Thanks @Cmjohnson! [15:51:35] (03PS5) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [15:55:28] (03CR) 10Dzahn: "single quotes = fails in cron, double quotes = fails in cron with other error, no quotes and escape % = Unrecognized escape sequence '\%' " [puppet] - 10https://gerrit.wikimedia.org/r/450236 (owner: 10Dzahn) [15:56:20] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Get helm test to dump more information - https://phabricator.wikimedia.org/T200348 (10thcipriani) 05Open>03Invalid This task is now unnecessary since @akosiaris updated the RBAC was updated in T199489 (\o/) [15:56:22] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10thcipriani) [15:59:12] (03PS1) 10Herron: assign archiva1001 spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/450241 (https://phabricator.wikimedia.org/T200895) [15:59:19] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10RobH) p:05Triage>03High [16:00:36] (03PS1) 10RobH: jmorgan ssh key revocation [puppet] - 10https://gerrit.wikimedia.org/r/450242 (https://phabricator.wikimedia.org/T201185) [16:01:22] (03CR) 10Herron: [C: 032] assign archiva1001 spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/450241 (https://phabricator.wikimedia.org/T200895) (owner: 10Herron) [16:01:46] (03CR) 10RobH: [C: 032] jmorgan ssh key revocation [puppet] - 10https://gerrit.wikimedia.org/r/450242 (https://phabricator.wikimedia.org/T201185) (owner: 10RobH) [16:01:54] (03PS2) 10RobH: jmorgan ssh key revocation [puppet] - 10https://gerrit.wikimedia.org/r/450242 (https://phabricator.wikimedia.org/T201185) [16:02:02] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10ayounsi) We need to always have spares of the cables we use in production. Are spares being tracked somewhere? As this one is urgent to replace and Fiberstore is quick, I'd say go with them as well. [16:02:08] herron: are we rebase racing? =] [16:03:12] yeah man, for pink slips [16:03:32] (03CR) 10Dzahn: [C: 031] "yes please. it was reported by the account checker script and i have sent an email to the user" [puppet] - 10https://gerrit.wikimedia.org/r/450242 (https://phabricator.wikimedia.org/T201185) (owner: 10RobH) [16:07:18] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10RobH) [16:08:52] heh [16:09:12] * robh imagines herron and himself as greasers racing in the la river basin [16:09:33] very 1960s [16:09:44] :D [16:13:17] 10Operations, 10SRE-Access-Requests: Jmorgan production ssh revokation/replacement (due to key in use in production and cloud) - https://phabricator.wikimedia.org/T201185 (10RobH) Just to clarify, the fix for this is easy: * generate a new private/public ssh keypair for WMF production access ** do not use thi... [16:14:58] 10Operations, 10Thumbor: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10Reedy) [16:19:16] 10Operations, 10Thumbor: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10Reedy) [16:21:31] (03PS1) 10Bstorm: gridengine: remove puppetized hosts file [puppet] - 10https://gerrit.wikimedia.org/r/450247 (https://phabricator.wikimedia.org/T139190) [16:23:01] (03PS1) 10RobH: new auth1002 server install params [puppet] - 10https://gerrit.wikimedia.org/r/450248 (https://phabricator.wikimedia.org/T196698) [16:23:29] (03CR) 10RobH: [C: 032] new auth1002 server install params [puppet] - 10https://gerrit.wikimedia.org/r/450248 (https://phabricator.wikimedia.org/T196698) (owner: 10RobH) [16:28:21] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) a:03ayounsi Juniper ticket 2018-0803-0360 created. According to Kibana, this started ~6h after T201095 got created. As it outputs Critical and Emergency logs, another questio... [16:32:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:34:17] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - = "emerg" [16:34:19] 04Critical- Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Emergency syslog message got better [16:35:03] 04Critical- Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Emergency syslog message got better [16:36:50] o_0 [16:37:26] 10Operations, 10ops-eqiad: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10RobH) [16:37:27] ^ XioNoX [16:37:37] 10Operations, 10ops-eqiad: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10RobH) [16:37:58] yep, I created new alerts for https://phabricator.wikimedia.org/T201145#4477030 [16:38:06] we used to only catch criticals [16:38:13] ahh [16:38:20] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Critical syslog messages [16:38:22] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message [16:38:27] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:38:42] what action to take when those fire? [16:38:54] panic! [16:39:28] my understanding is if its just one link out of half a dozen at a site its no big deal. if we suddenly lose all the links to something like ulsfo, eqsin, or esams, we can depool them [16:39:41] if we lose all the links to eqiad, revert to panic ;D [16:41:22] i realize that was mostly directed at XioNoX justsharing what i know =] [16:43:01] guessing also to log into librenms and find more details about the syslog errors [16:43:17] I'll add them to https://wikitech.wikimedia.org/wiki/Network_monitoring when I have some time [16:43:46] but emergency is usually pretty bad, here for example a switch is restarting every 5min, so it's worth paging people [16:46:46] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 28, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 3, active_shards_percent_as_number: 100.0, active_shards: 49, initializ [16:46:47] er_of_data_nodes: 2, delayed_unassigned_shards: 0 [16:48:38] ACKNOWLEDGEMENT - Elasticsearch HTTPS on relforge1001 is CRITICAL: SSL CRITICAL - failed to verify relforge.svc.eqiad.wmnet against relforge1001.eqiad.wmnet Gehel SSL cert needs to be regenerated after reimage - https://phabricator.wikimedia.org/T193649 [17:01:51] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) CCed you to the JTAC case, not sure how to make sure you have default access to all the cases. So far poor replies from JTAC, I'll escalate if it doesn't get proper respons... [17:03:56] (03PS7) 10Bstorm: Allow PuppetDB use on standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T153577) (owner: 10Alex Monk) [17:10:17] 10Operations, 10User-ArielGlenn, 10User-Joe: Puppetize the installation of PHP-FPM on the MediaWiki hosts - https://phabricator.wikimedia.org/T201140 (10Legoktm) [17:12:30] (03PS1) 10RobH: fixing typo in reverse dns for auth1002 [dns] - 10https://gerrit.wikimedia.org/r/450249 (https://phabricator.wikimedia.org/T196698) [17:12:34] (03CR) 10RobH: [C: 032] fixing typo in reverse dns for auth1002 [dns] - 10https://gerrit.wikimedia.org/r/450249 (https://phabricator.wikimedia.org/T196698) (owner: 10RobH) [17:17:50] (03PS1) 10Herron: add forward/reverse DNS for archiva1001 IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/450250 (https://phabricator.wikimedia.org/T200895) [17:19:07] (03CR) 10Bstorm: [C: 032] Allow PuppetDB use on standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T153577) (owner: 10Alex Monk) [17:21:06] 10Operations, 10netops: cr1/2-eqiad PFE_FW_SYSLOG_IP6_GEN log entries - https://phabricator.wikimedia.org/T201149 (10ayounsi) a:03ayounsi `/kernel: Nexthop index allocation failed` due to router limitation and the design of our mgmt network (see description of T174397) `PFE_FW_SYSLOG_IP6_GEN` is temporary l... [17:21:33] (03PS2) 10Herron: add forward/reverse DNS for archiva1001 IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/450250 (https://phabricator.wikimedia.org/T200895) [17:22:54] (03CR) 10Herron: [C: 032] add forward/reverse DNS for archiva1001 IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/450250 (https://phabricator.wikimedia.org/T200895) (owner: 10Herron) [17:29:37] (03PS7) 10Dzahn: jenkins: add workspacesDir system property [puppet] - 10https://gerrit.wikimedia.org/r/449769 (https://phabricator.wikimedia.org/T200953) (owner: 10Dduvall) [17:30:06] 10Operations, 10vm-requests, 10Patch-For-Review, 10User-herron: eqiad: (1) VM request for Archiva - https://phabricator.wikimedia.org/T200895 (10herron) 05Open>03Resolved Alrighty, `archiva1001.wikimedia.org` has been provisioned and assigned role `spare::system`. Please re-open if you need anything e... [17:36:35] 10Operations, 10Patch-For-Review: requesting additional production ssh key for jmorgan - https://phabricator.wikimedia.org/T200103 (10Aklapper) Related: T201185 [17:37:29] 10Operations, 10ops-eqiad: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10RobH) a:05RobH>03None [17:39:16] 10Operations: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff: It is my understanding that you are the primary person that handles the authentication servers. (If not, please correct me!) auth1002.eqiad.wmnet is online and ready for y... [17:42:32] (03CR) 10Thcipriani: [C: 031] "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/11982/" [puppet] - 10https://gerrit.wikimedia.org/r/449769 (https://phabricator.wikimedia.org/T200953) (owner: 10Dduvall) [17:47:10] mutante: just fyi, we're ready to issue a restart if you think ^ is good to merge [17:47:41] ready = standing by [17:47:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10RobH) a:05RobH>03Cmjohnson I see we have a bunch of sprae: Intel 320 Series SSDSA2CW300G3 2.5" 300GB Can we try putting in an SSD and rebuilding, rather than buying an e... [17:49:24] marxarelli: ok! i got distracted. let's get it done :) [17:49:30] sounds good [17:49:45] thanks! [17:49:56] (03CR) 10Dzahn: [C: 032] jenkins: add workspacesDir system property [puppet] - 10https://gerrit.wikimedia.org/r/449769 (https://phabricator.wikimedia.org/T200953) (owner: 10Dduvall) [17:50:54] (03PS2) 10Bstorm: wiki replicas: moving compatibility views to $table_compat [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) [17:51:29] marxarelli: the unit file has been updated on both contint servers [17:51:37] should i just retart it too? already on them [17:51:51] mutante: awesome. thanks! i think thcipriani and i are going to handle the restart [17:52:03] ok, cool, stepping back :) [17:52:05] PROBLEM - Check systemd state on auth1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:52:10] so we can keep an eye on zuul, etc. at the same time [17:52:20] oh, maybe i can sneak in a second change? [17:52:29] for zuul , heh [17:52:37] mutante: could you run puppet on releases1001 and releases2001 as well? [17:52:52] sure [17:53:03] thank you! [17:53:13] what's the second change? [17:53:42] to make zuul use systemd::service like we just did for jenkins [17:53:42] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/434427/ [17:53:48] for jenkins it was noop [17:54:26] RECOVERY - Check systemd state on auth1002 is OK: OK - running: The system is fully operational [17:59:05] thcipriani: puppet runs are done since a while, sry [18:01:33] mutante: on that patch I don't know about adding restart => true to the zuul server. I think we may want to manage that manually since it'd drop all the patches that are currently queued since it's tied to gearman. [18:03:03] thcipriani: that's not supposed to be a change in behaviour. the thing is that before it was an implicit default [18:03:11] to have refresh = true, [18:03:23] 10Operations, 10netops: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) a:03ayounsi [18:03:44] and with the new class you would have to mention it [18:04:34] i'll wait [18:05:03] !log restarting releases-jenkins [18:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:14] !log restarting jenkins on contint1001 [18:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:19] (03PS1) 10Bstorm: osmdb: Change labsdb1007 back to slave and reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/450251 (https://phabricator.wikimedia.org/T197246) [18:08:43] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3031_v4, cp3031_v6 [18:09:51] What's up with the beta cluster? https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [18:10:18] de is working actually. [18:11:38] Niharika: I don't see what you're seeing? [18:12:35] (03CR) 10Bstorm: [C: 032] osmdb: Change labsdb1007 back to slave and reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/450251 (https://phabricator.wikimedia.org/T197246) (owner: 10Bstorm) [18:13:22] (03PS1) 10RobH: setting graphite1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/450252 (https://phabricator.wikimedia.org/T196484) [18:13:53] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [18:14:07] (03CR) 10jerkins-bot: [V: 04-1] setting graphite1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/450252 (https://phabricator.wikimedia.org/T196484) (owner: 10RobH) [18:14:23] (03PS2) 10RobH: setting graphite1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/450252 (https://phabricator.wikimedia.org/T196484) [18:15:39] (03CR) 10RobH: [C: 032] setting graphite1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/450252 (https://phabricator.wikimedia.org/T196484) (owner: 10RobH) [18:15:47] (03PS3) 10RobH: setting graphite1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/450252 (https://phabricator.wikimedia.org/T196484) [18:17:49] mutante: sorry, I was doing restart/followup stuff. I didn't realize the refresh was implicit before! I'll review your change now. [18:18:35] greg-g: That wiki is down for me. "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes" [18:22:24] I am seeing the same message on beta now [18:22:38] logs showing: [18:22:46] > Call to undefined method WikitextContent::getEntity() in /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/WikiPageEntityRevisionLookup.php on line 189 [18:22:51] still loads fine for me, but I believe you two :) [18:22:57] add ?debug=true [18:24:39] got it trying to login [18:27:02] thcipriani: Does https://en.wikipedia.beta.wmflabs.org/wiki/Special:Preferences load for you? [18:27:16] It's apparently only the front page that gives the error. [18:27:35] it does load [18:27:49] 10Operations, 10monitoring: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10RobH) a:05RobH>03fgiunchedi Please note I set this to role spare, since I wasn't sure if setting it to any other role may produce logging spam/traffic/alerts to the other graphite hosts. When in doub... [18:28:00] 10Operations, 10monitoring: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10RobH) [18:28:53] Does anyone want me to file a task for that? [18:28:57] filed as: https://phabricator.wikimedia.org/T201194 [18:29:36] Thanks thcipriani! [18:30:07] thanks for noticing it [18:30:37] No problem. I was trying to test a feature and realized it's down. [18:34:43] (03CR) 10BryanDavis: [C: 031] gridengine: remove puppetized hosts file [puppet] - 10https://gerrit.wikimedia.org/r/450247 (https://phabricator.wikimedia.org/T139190) (owner: 10Bstorm) [18:37:05] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10herron) p:05Triage>03Normal [18:41:01] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Diego da Hora - https://phabricator.wikimedia.org/T201197 (10herron) p:05Triage>03Normal [18:41:04] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Marc Jeanmougin - https://phabricator.wikimedia.org/T201198 (10herron) p:05Triage>03Normal [18:41:06] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10herron) p:05Triage>03Normal [18:42:35] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) [18:44:35] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10herron) Thanks @Rossi.dario.g! I've created T201196 to track your individual access request, and moved your access request checklist into this task. I've done the same for the oth... [18:46:35] (03PS3) 10Dzahn: postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 [18:52:29] (03PS4) 10Dzahn: postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 [19:00:56] (03CR) 10Dzahn: [C: 032] postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 (owner: 10Dzahn) [19:01:09] (03PS5) 10Dzahn: postgresql::backup: remove all quotes to still fix cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450236 [19:04:54] (03PS1) 10RobH: centrallog1001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/450256 (https://phabricator.wikimedia.org/T195416) [19:05:09] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632 (10Harej) p:05Normal>03High [19:05:38] (03PS1) 10Dzahn: postgresql::backup: don't run both crons at same minute [puppet] - 10https://gerrit.wikimedia.org/r/450257 [19:06:01] (03PS2) 10RobH: centrallog1001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/450256 (https://phabricator.wikimedia.org/T195416) [19:06:47] (03CR) 10RobH: [C: 032] centrallog1001 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/450256 (https://phabricator.wikimedia.org/T195416) (owner: 10RobH) [19:07:22] (03PS2) 10Dzahn: postgresql::backup: don't run both crons at same minute [puppet] - 10https://gerrit.wikimedia.org/r/450257 (https://phabricator.wikimedia.org/T190184) [19:10:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1003 raid warning - https://phabricator.wikimedia.org/T200203 (10Andrew) @robh, worth a try! [19:22:27] (03PS3) 10Dzahn: postgresql::backup: don't run both crons at same minute (debug) [puppet] - 10https://gerrit.wikimedia.org/r/450257 (https://phabricator.wikimedia.org/T190184) [19:23:04] (03CR) 10jerkins-bot: [V: 04-1] postgresql::backup: don't run both crons at same minute (debug) [puppet] - 10https://gerrit.wikimedia.org/r/450257 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [19:26:52] (03PS4) 10Dzahn: postgresql::backup: don't run both crons at same minute (debug) [puppet] - 10https://gerrit.wikimedia.org/r/450257 (https://phabricator.wikimedia.org/T190184) [19:29:54] (03CR) 10Thcipriani: "One inline comment about me being nervous" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:40:05] (03PS1) 10Dzahn: postgresql::dump: unify commands in a single cron job [puppet] - 10https://gerrit.wikimedia.org/r/450261 (https://phabricator.wikimedia.org/T190184) [19:41:04] (03CR) 10jerkins-bot: [V: 04-1] postgresql::dump: unify commands in a single cron job [puppet] - 10https://gerrit.wikimedia.org/r/450261 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [19:51:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) [19:58:27] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632 (10awight) @Harej Thanks for the bump, this scares me too. [20:06:51] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) JTAC came back with troubleshooting and data gathering commands/configuration to do if the issue happen again. [20:12:00] 10Operations, 10ops-eqiad: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10RobH) [20:16:23] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) JTAC recommendation is to format and re-install the switch member using: https://kb.juniper.net/InfoCenter/index?page=content&id=KB20643 In their emails they say that only usb i... [20:41:12] (03PS1) 10Herron: admin: add bmueller to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/450280 (https://phabricator.wikimedia.org/T199965) [20:43:35] (03CR) 10Herron: [C: 032] admin: add bmueller to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/450280 (https://phabricator.wikimedia.org/T199965) (owner: 10Herron) [20:50:51] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:56] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10herron) [20:56:01] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:02] 10Operations, 10LDAP-Access-Requests, 10Graphite, 10Patch-For-Review, 10User-Addshore: Give Bmueller grafana-admin access - https://phabricator.wikimedia.org/T199965 (10herron) 05Open>03Resolved a:03herron Bmueller has been added to ldap groups `cn=wmde,ou=groups,dc=wikimedia,dc=org` and `cn=nda,ou... [20:58:43] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10herron) Bmueller has been added to ldap groups nda and wmde (see T199965), but I am not seeing the ldap account for Lea. What is their ldap username? [21:07:52] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10herron) a:05Gilles>03Rossi.dario.g Hi @Rossi.dario.g, are you sure that your wikitech username is drossi? I am not finding a result at https://wikitech.w... [21:09:20] 10Operations, 10Analytics, 10Documentation: Remove data from Hadoop's HDFS as part of the user offboard workflow - https://phabricator.wikimedia.org/T200312 (10herron) p:05Triage>03Normal [21:10:20] 10Operations, 10Thumbor: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10herron) p:05Triage>03High [21:11:21] 10Operations, 10DBA, 10monitoring: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10herron) p:05Triage>03Normal [21:13:04] 10Operations, 10Graphite: Include ADD operation in memcached stats and grafana dashboard - https://phabricator.wikimedia.org/T201016 (10herron) p:05Triage>03Normal [21:14:56] 10Operations, 10Packaging, 10Toolforge: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10herron) p:05Triage>03Normal [21:17:21] 10Operations, 10Packaging, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10herron) p:05Triage>03Normal [21:54:14] 10Operations: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10RobH) a:05RobH>03fgiunchedi [21:55:12] 10Operations: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10RobH) @fgiunchedi: You were the SRE team member to provide feedback regarding the disk capacity, so I'm assuming you would be the service owner. If this isn't correct, please comment/assign back to me/assi... [22:03:43] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:08:10] (03PS1) 10RobH: setting dns100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/450286 (https://phabricator.wikimedia.org/T196691) [22:08:52] (03CR) 10RobH: [C: 032] setting dns100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/450286 (https://phabricator.wikimedia.org/T196691) (owner: 10RobH) [22:16:35] 10Operations, 10ops-eqiad, 10DNS, 10Traffic: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10RobH) [22:16:42] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational [22:38:52] (03CR) 10Bstorm: "I almost merged this now, but I think I'll do it first thing on Monday morning instead." [puppet] - 10https://gerrit.wikimedia.org/r/450247 (https://phabricator.wikimedia.org/T139190) (owner: 10Bstorm) [22:39:25] 10Operations, 10ops-eqiad, 10DNS, 10Traffic: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10RobH) a:05RobH>03BBlack So these two systems fail their puppet runs, but fail for the following: Error: Could not retrieve catalog from remote server: Error 500 on SE... [22:39:39] 10Operations, 10DNS, 10Traffic: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10RobH) [22:55:29] (03PS7) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [23:07:28] !log - restart asw2-b5-eqiad into loader - T201145 [23:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:33] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [23:42:06] PROBLEM - TFTP service on install1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* [23:43:25] RECOVERY - TFTP service on install1002 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .*