[00:06:03] (03CR) 10Dzahn: [C: 03+2] aphlict: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/616184 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [00:32:21] (03PS2) 10Dave Pifke: [WIP] arclamp: run arclamp-compress-logs [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T235456) [00:35:55] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) Dave Pifke zipped some files and is working on a patch to make xhgui support gzipped files. [00:42:39] !log Manually compressing some more data on webperf1002, using arclamp-compress-logs from https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/615904. [00:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:34] dpifke: i see the disks i tried to create have been created so there is one 100GB and one 20GB disk assigned to webperf1002 and not mounted. but those would not be useful so i remove them again? [00:43:56] Yeah, at this point I think we're almost good to go with gzip. [00:44:13] ok, that's great, thx [00:46:30] !log ganeti - removing disk 3 (20G) from webperf1002. the disks are 0-indexed, so the ones actually mounted are 0 (50G) and 1 (300G) (T257931) [00:46:41] looks like removing one takes just as long as creating it ... [00:46:52] Fun. :) [00:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:42] T257931: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 [00:50:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:40] (03CR) 10Dave Pifke: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1001/24135/" [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [01:08:36] RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [01:52:47] !log ganeti - also removing (unmounted) disk 2 (100G) from webperf1002. T257931 [01:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:54] T257931: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 [03:39:22] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [03:41:16] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [04:52:58] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [04:54:52] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [06:11:26] (03CR) 10Alexandros Kosiaris: api-gateway: Basic envoy chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [06:39:54] 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) @Aklapper anyone I can reach out to for this task? [06:48:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:54:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200725T0700) [07:07:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:09:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:32:24] (03PS1) 10Elukey: profile::analytics::refinery::job::refine: disable monitor [puppet] - 10https://gerrit.wikimedia.org/r/616198 [07:35:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:28] goood [08:34:42] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:34:58] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:35:02] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [08:35:12] PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:35:58] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:34] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:37:06] PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:37:18] PROBLEM - dhclient process on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:38:50] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:49:14] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [08:51:42] PROBLEM - Disk space on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [08:52:18] PROBLEM - configured eth on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [08:55:42] PROBLEM - IPMI Sensor Status on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:55:58] PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:56] db1082 crashed last week (https://phabricator.wikimedia.org/T258336) [09:01:06] PROBLEM - DPKG on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:01:26] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:03:58] PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:06:40] RECOVERY - Host db1082 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [09:08:36] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10RhinosF1) In UK time (UTC+1) today, it went down for 13 mins: > 09:55:58 PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100% > 10:03:58 PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL... [09:11:05] PROBLEM - MariaDB Replica IO: s5 #page on db1082 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:11:08] PROBLEM - MariaDB read only s5 on db1082 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:11:25] PROBLEM - mysqld processes #page on db1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:11:44] <_joe_> uhm I'm not getting paged though, is this the server already removed from prod? [09:11:46] PROBLEM - MariaDB Replica Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1098.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:11:52] interesting, they say # p a g e, but don't page [09:12:03] <_joe_> XioNoX: I guess it's the host that crashed last week [09:12:17] <_joe_> so we probably have notifications disabled in icinga [09:12:21] PROBLEM - MariaDB Replica SQL: s5 #page on db1082 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:12:24] ok! [09:12:27] good :) [09:12:46] _joe_: https://phabricator.wikimedia.org/T258336 [09:13:02] It's same one [09:13:54] <_joe_> XioNoX: actually I think we need to depool it [09:13:58] <_joe_> I'm going to do it [09:14:11] <_joe_> can you ping manuel? [09:14:17] _joe_: yep [09:16:00] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:16:07] _joe_: sent him a message on IRC and whatsapp, let me know if I should call [09:16:17] !log oblivian@cumin1001 dbctl commit (dc=all): 'Depool db1082 T258336', diff saved to https://phabricator.wikimedia.org/P12040 and previous config saved to /var/cache/conftool/dbconfig/20200725-091616-oblivian.json [09:16:20] RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:23] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [09:17:05] hey, i'm around [09:17:06] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:28] * volans|off here [09:18:21] the same host crashed a week ago [09:18:24] let's depool it [09:18:34] and leave it for later/monday [09:18:37] RECOVERY - MariaDB Replica IO: s5 #page on db1082 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:18:38] volans: I think joe just depooled it [09:18:42] RECOVERY - MariaDB read only s5 on db1082 is OK: Version 10.1.44-MariaDB, Uptime 62s, read_only: True, event_scheduler: True, 1500.47 QPS, connection latency: 0.004201s, query latency: 0.001055s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:18:46] I can review with kormat the weights of the remaining [09:18:48] hosts [09:18:50] <_joe_> volans: already depooled it [09:18:53] thx [09:18:57] RECOVERY - mysqld processes #page on db1082 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:19:20] <_joe_> ok I know why it didn't page [09:19:26] the only thing is that db1124 replicates from it [09:19:32] <_joe_> there was still an old incident opened and acked [09:19:40] :/ [09:19:49] I got the irc notification only too fwiw [09:19:55] RECOVERY - MariaDB Replica SQL: s5 #page on db1082 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:19:56] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [09:20:10] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2002 is OK: OK: synced at Sat 2020-07-25 09:20:08 UTC. https://wikitech.wikimedia.org/wiki/NTP [09:20:12] i've started mariadb, and it's catching up on replication [09:20:25] <_joe_> kormat: aye [09:20:30] <_joe_> I'd leave it depooled [09:20:35] +1 [09:20:36] RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:20:56] RECOVERY - MariaDB Replica IO: s5 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:21:12] RECOVERY - MariaDB Replica Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:21:17] <_joe_> kormat: do we need to add a server to the api pool for s5? [09:21:28] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Kormat) [09:21:30] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Kormat) 05Resolved→03Open Re-opening to track the latest crash. [09:21:33] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Kormat) [09:21:42] <_joe_> it currently has just 1 server [09:21:43] yeah I got the recovery via victorops [09:21:48] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:21:48] bbu again ? [09:21:50] <_joe_> I think we do [09:22:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:17] <_joe_> marostegui: not sure, I'm trying first to solve the "we have only one server for api in s5" problem [09:22:19] did the host reboot? [09:22:25] marostegui: yes [09:22:28] I'm flying [09:22:36] i'll check the hw logs [09:22:39] <_joe_> yes [09:22:49] <_joe_> marostegui: ahahah then don't worry :P [09:23:02] marostegui: yes [09:23:02] s5 should be fine with just 1api [09:23:06] for a few hours [09:23:08] I've the hw logs in front [09:23:10] RECOVERY - configured eth on kubernetes2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:23:23] <_joe_> marostegui: ok, what if that server breaks too? [09:23:27] do we have a new bug or can I just reoenlast week one? [09:23:35] mw will just pick any of the others [09:23:35] volans: i reopened the old one [09:23:42] thx [09:23:46] <_joe_> marostegui: ack thanks [09:23:52] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Kormat) ` /system1/log1/record18 Targets Properties number=18 severity=Caution date=07/25/2020 time=08:53 description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1,... [09:23:58] it is bbu issues with those old ho hosts [09:24:06] hp [09:24:11] just leave it deppoled [09:24:18] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Volans) RElated hw logs for this new crash looks the same of last week: ` hpiLO-> show /system1/log1/record19 status=0 status_tag=COMMAND COMPLETED Sat Jul 25 09:22:56 2020 /system1/log1/record19 Targets Properties... [09:24:19] marostegui: hw logs in the tasks, seems the same to me [09:24:19] I will get to it when I land [09:24:44] rotfl [09:24:49] what a team [09:25:07] <_joe_> yeah please mind the plane :D [09:25:12] hahahaha [09:25:14] epic [09:25:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:25:59] FYI [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired [09:26:34] RECOVERY - IPMI Sensor Status on kubernetes2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:27:42] volans thatsok [09:28:35] we need to get rid of those hps it was scheduled for q2 but we should accelerate that, will talk to mark next week [09:29:09] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Volans) For reference mysqld start logs: ` Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25 9:17:37 139825736157440 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.44-MariaDB) starting as process 3356 ... Jul 25 09:17:37 db10... [09:29:52] roger [09:32:00] RECOVERY - DPKG on kubernetes2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:32:20] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:33:28] RECOVERY - Disk space on kubernetes2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [09:39:02] RECOVERY - dhclient process on kubernetes2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [09:45:06] won't the crash of db1082 cause lag on labsdb s5, like last week? [09:47:39] 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) @Chtnnh: See https://phabricator.wikimedia.org/project/profile/1305/ ; apart from that see the point persons that you listed, I'd say? [09:57:10] 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) I meant someone from the Service deployment team, so that it can be on their radar. T214201 is waiting on this task. [09:58:13] <_joe_> Wiki13: the server is up and has 0 lag [09:58:53] well i saw replag climbing on s5 when looking at replag.toolforge.org, hence i was asking :) [10:00:04] <_joe_> ok I'm just looking at sanitarium, not labsdb [10:00:34] <_joe_> so maybe the problem is there? but I'd be surprised [10:02:25] <_joe_> Wiki13: uhm I don't see high lag for s5 on replag, am I missing something? I'm not familiar with the tool [10:02:59] <_joe_> it seems to depend on when I refresh the page, which is weird [10:03:02] the tool is a bit weird, sometimes its 0 and sometimes not [10:03:44] guessing its probably multiple servers where one has no lag at all and the other one that was replicating from 1082 causes a high value [10:04:23] causing errors to be given in several tools that show replag warnings [10:05:22] <_joe_> to be clear, db1082 is up and sanitarium is replicating correctly [10:05:38] <_joe_> I don't know enough about labs replicas, I'll need to read the docs [10:05:44] <_joe_> kormat: do you know anything more? [10:07:45] FYI, I was refering to https://sal.toolforge.org/log/ihmzY3MBj_Bg1xd3cEYk [10:08:28] <_joe_> that was because the db actually crashed and was recovering [10:08:33] <_joe_> now it's back up [10:10:20] Ok. Was just verifying if that was applicable to this incident aswell, it seems not. I have no further questions now [10:10:59] _joe_: it should be fine, but I'll take a look [10:11:45] <_joe_> ok it's labsdb1009.eqiad.wmnet [10:11:52] <_joe_> it has high lag for all shards [10:12:08] <_joe_> Wiki13: thanks for reporting though [10:12:11] that host has been on the struggle bus this week [10:12:26] <_joe_> it wasn't related but it looks like an issue [10:12:28] ah I see, that explains it [10:12:39] m.anuel has been keeping an eye on it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/615725 [10:13:09] <_joe_> ah ok, yeah it seems like a good idea maybe [10:13:30] <_joe_> Wiki13: it will be all shards, sadly, not just s5 [10:14:24] well, guess then we have to live with some lag a bit longer on toolforge until it clears up [10:14:35] thanks for the info :) [10:15:17] yeah all shards on labsdb1009 are on average about 5000s behind [10:33:02] (03CR) 10DannyS712: [C: 03+1] Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127) (owner: 10Krinkle) [10:42:59] (03CR) 10Krinkle: [C: 03+1] "Interesting. So messages from MW start with "ERR" and became "ERROR" and from php7-fatal-handler, they came in as "err" and became ERR?" [puppet] - 10https://gerrit.wikimedia.org/r/616116 (https://phabricator.wikimedia.org/T248181) (owner: 10Herron) [10:46:50] PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:54:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:55:20] 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) @Chtnnh: See members listed on https://phabricator.wikimedia.org/project/profile/1305/ [10:55:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:56:10] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:42] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) [11:28:08] PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:29:54] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:30:54] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 57 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:33:16] 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) Thank you @Aklapper @akosiaris Hey! Chaitanya here, a volunteer from the scoring platform team. Need your help deploying this service... [11:36:46] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 44 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:03:46] PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:07:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:10:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:50] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:34:42] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) Interestingly the OS still sees the BBU (which is something we've seen sometimes too). ` Battery/Capacitor Count: 1 ` This means that the above crash can still happen again. It would be better if the BBU would be... [12:34:52] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) p:05Medium→03High [12:35:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:39:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1096:3315 into s5 api afte db1082 crashed T258336', diff saved to https://phabricator.wikimedia.org/P12041 and previous config saved to /var/cache/conftool/dbconfig/20200725-124104-marostegui.json [12:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:12] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [17:55:51] (03PS1) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) [17:57:39] (03PS6) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) [18:20:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:05:35] (03PS2) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) [19:11:54] (03PS8) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [19:12:26] (03PS3) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601) [20:05:28] PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [20:28:00] RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [21:03:13] 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Ferdi2005) @ovasileva it.Wikinews is really poorly indicizated, an article "Apple passa ad ARM e annuncia altre n... [23:48:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:49:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets