[00:06:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aphlict: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/616184 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[00:32:21] <wikibugs>	 (03PS2) 10Dave Pifke: [WIP] arclamp: run arclamp-compress-logs [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T235456)
[00:35:55] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Dzahn) Dave Pifke zipped some files and is working on a patch to make xhgui support gzipped files.
[00:42:39] <dpifke>	 !log Manually compressing some more data on webperf1002, using arclamp-compress-logs from https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/615904.
[00:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:34] <mutante>	 dpifke: i see the disks i tried to create have been created so there is one 100GB and one 20GB disk assigned to webperf1002 and not mounted. but those would not be useful so i remove them again?
[00:43:56] <dpifke>	 Yeah, at this point I think we're almost good to go with gzip.
[00:44:13] <mutante>	 ok, that's great, thx
[00:46:30] <mutante>	 !log ganeti - removing disk 3 (20G) from webperf1002. the disks are 0-indexed, so the ones actually mounted are 0 (50G) and 1 (300G) (T257931)
[00:46:41] <mutante>	 looks like removing one takes just as long as creating it ... 
[00:46:52] <dpifke>	 Fun. :)
[00:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:42] <stashbot>	 T257931: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931
[00:50:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:40] <wikibugs>	 (03CR) 10Dave Pifke: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1001/24135/" [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke)
[01:08:36] <icinga-wm>	 RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops
[01:52:47] <mutante>	 !log ganeti - also removing (unmounted) disk 2 (100G) from webperf1002. T257931
[01:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:52:54] <stashbot>	 T257931: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931
[03:39:22] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[03:41:16] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[04:52:58] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[04:54:52] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[06:11:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: api-gateway: Basic envoy chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[06:39:54] <wikibugs>	 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) @Aklapper anyone I can reach out to for this task?
[06:48:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:54:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200725T0700)
[07:07:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:09:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:32:24] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::refinery::job::refine: disable monitor [puppet] - 10https://gerrit.wikimedia.org/r/616198
[07:35:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:36:28] <elukey>	 goood
[08:34:42] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[08:34:58] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:35:02] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton
[08:35:12] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[08:35:58] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:34] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[08:37:06] <icinga-wm>	 PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:37:18] <icinga-wm>	 PROBLEM - dhclient process on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[08:38:50] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[08:49:14] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[08:51:42] <icinga-wm>	 PROBLEM - Disk space on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops
[08:52:18] <icinga-wm>	 PROBLEM - configured eth on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[08:55:42] <icinga-wm>	 PROBLEM - IPMI Sensor Status on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:55:58] <icinga-wm>	 PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100%
[08:57:56] <RhinosF1>	 db1082 crashed last week (https://phabricator.wikimedia.org/T258336)
[09:01:06] <icinga-wm>	 PROBLEM - DPKG on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:01:26] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:03:58] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:06:40] <icinga-wm>	 RECOVERY - Host db1082 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[09:08:36] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10RhinosF1) In UK time (UTC+1) today, it went down for 13 mins: > 09:55:58 <icinga-wm> PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100% > 10:03:58 <icinga-wm> PROBLEM - MariaDB Replica IO: s5 on db1124 is CRITICAL...
[09:11:05] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 #page on db1082 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:11:08] <icinga-wm>	 PROBLEM - MariaDB read only s5 on db1082 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:11:25] <icinga-wm>	 PROBLEM - mysqld processes #page on db1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:11:44] <_joe_>	 uhm I'm not getting paged though, is this the server already removed from prod?
[09:11:46] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1098.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:11:52] <XioNoX>	 interesting, they say # p a g e, but don't page
[09:12:03] <_joe_>	 XioNoX: I guess it's the host that crashed last week
[09:12:17] <_joe_>	 so we probably have notifications disabled in icinga
[09:12:21] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 #page on db1082 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:12:24] <XioNoX>	 ok!
[09:12:27] <XioNoX>	 good :)
[09:12:46] <RhinosF1>	 _joe_: https://phabricator.wikimedia.org/T258336
[09:13:02] <RhinosF1>	 It's same one
[09:13:54] <_joe_>	 XioNoX: actually I think we need to depool it
[09:13:58] <_joe_>	 I'm going to do it
[09:14:11] <_joe_>	 can you ping manuel?
[09:14:17] <XioNoX>	 _joe_: yep
[09:16:00] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:16:07] <XioNoX>	 _joe_: sent him a message on IRC and whatsapp, let me know if I should call
[09:16:17] <logmsgbot>	 !log oblivian@cumin1001 dbctl commit (dc=all): 'Depool db1082 T258336', diff saved to https://phabricator.wikimedia.org/P12040 and previous config saved to /var/cache/conftool/dbconfig/20200725-091616-oblivian.json
[09:16:20] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:23] <stashbot>	 T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336
[09:17:05] <kormat>	 hey, i'm around
[09:17:06] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:28] * volans|off here
[09:18:21] <volans>	 the same host crashed a week ago
[09:18:24] <volans>	 let's depool it
[09:18:34] <volans>	 and leave it for later/monday
[09:18:37] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 #page on db1082 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:18:38] <XioNoX>	 volans: I think joe just depooled it
[09:18:42] <icinga-wm>	 RECOVERY - MariaDB read only s5 on db1082 is OK: Version 10.1.44-MariaDB, Uptime 62s, read_only: True, event_scheduler: True, 1500.47 QPS, connection latency: 0.004201s, query latency: 0.001055s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[09:18:46] <volans>	 I can review with kormat the weights of the remaining
[09:18:48] <volans>	 hosts
[09:18:50] <_joe_>	 volans: already depooled it
[09:18:53] <volans>	 thx
[09:18:57] <icinga-wm>	 RECOVERY - mysqld processes #page on db1082 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:19:20] <_joe_>	 ok I know why it didn't page
[09:19:26] <volans>	 the only thing is that db1124 replicates from it
[09:19:32] <_joe_>	 there was still an old incident opened and acked
[09:19:40] <volans>	 :/
[09:19:49] <volans>	 I got the irc notification only too fwiw
[09:19:55] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 #page on db1082 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:19:56] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[09:20:10] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2002 is OK: OK: synced at Sat 2020-07-25 09:20:08 UTC. https://wikitech.wikimedia.org/wiki/NTP
[09:20:12] <kormat>	 i've started mariadb, and it's catching up on replication
[09:20:25] <_joe_>	 kormat: aye
[09:20:30] <_joe_>	 I'd leave it depooled
[09:20:35] <kormat>	 +1
[09:20:36] <icinga-wm>	 RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:20:56] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:21:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:21:17] <_joe_>	 kormat: do we need to add a server to the api pool for s5?
[09:21:28] <wikibugs>	 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Kormat)
[09:21:30] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Kormat) 05Resolved→03Open Re-opening to track the latest crash.
[09:21:33] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Kormat)
[09:21:42] <_joe_>	 it currently has just 1 server
[09:21:43] <XioNoX>	 yeah I got the recovery via victorops
[09:21:48] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[09:21:48] <marostegui>	 bbu again ?
[09:21:50] <_joe_>	 I think we do
[09:22:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:22:17] <_joe_>	 marostegui: not sure, I'm trying first to solve the "we have only one server for api in s5" problem
[09:22:19] <marostegui>	 did the host reboot?
[09:22:25] <kormat>	 marostegui: yes
[09:22:28] <marostegui>	 I'm flying
[09:22:36] <kormat>	 i'll check the hw logs
[09:22:39] <_joe_>	 yes
[09:22:49] <_joe_>	 marostegui: ahahah then don't worry :P
[09:23:02] <volans>	 marostegui: yes
[09:23:02] <marostegui>	 s5 should be fine with just 1api
[09:23:06] <marostegui>	 for a few hours
[09:23:08] <volans>	 I've the hw logs in front
[09:23:10] <icinga-wm>	 RECOVERY - configured eth on kubernetes2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:23:23] <_joe_>	 marostegui: ok, what if that server breaks too?
[09:23:27] <volans>	 do we have a new bug or can I just reoenlast week one?
[09:23:35] <marostegui>	 mw will just pick any of the others
[09:23:35] <kormat>	 volans: i reopened the old one
[09:23:42] <volans>	 thx
[09:23:46] <_joe_>	 marostegui: ack thanks
[09:23:52] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Kormat) ` /system1/log1/record18   Targets   Properties     number=18     severity=Caution     date=07/25/2020     time=08:53     description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1,...
[09:23:58] <marostegui>	 it is bbu issues with those old ho hosts
[09:24:06] <marostegui>	 hp
[09:24:11] <marostegui>	 just leave it deppoled
[09:24:18] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Volans) RElated hw logs for this new crash looks the same of last week: ` </>hpiLO-> show /system1/log1/record19  status=0 status_tag=COMMAND COMPLETED Sat Jul 25 09:22:56 2020    /system1/log1/record19   Targets   Properties...
[09:24:19] <volans>	 marostegui: hw logs in the tasks, seems the same to me
[09:24:19] <marostegui>	 I will get to it when I land
[09:24:44] <volans>	 rotfl
[09:24:49] <volans>	 what a team
[09:25:07] <_joe_>	 yeah please mind the plane :D
[09:25:12] <XioNoX>	 hahahaha
[09:25:14] <XioNoX>	 epic
[09:25:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:25:59] <volans>	 FYI [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired
[09:26:34] <icinga-wm>	 RECOVERY - IPMI Sensor Status on kubernetes2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:27:42] <marostegui>	 volans thatsok
[09:28:35] <marostegui>	 we need to get rid of those hps it was scheduled for q2 but we should accelerate that, will talk to mark next week
[09:29:09] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Volans) For reference mysqld start logs: ` Jul 25 09:17:37 db1082 mysqld[3356]: 2020-07-25  9:17:37 139825736157440 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.44-MariaDB) starting as process 3356 ... Jul 25 09:17:37 db10...
[09:29:52] <volans>	 roger
[09:32:00] <icinga-wm>	 RECOVERY - DPKG on kubernetes2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:32:20] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:33:28] <icinga-wm>	 RECOVERY - Disk space on kubernetes2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops
[09:39:02] <icinga-wm>	 RECOVERY - dhclient process on kubernetes2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[09:45:06] <Wiki13>	 won't the crash of db1082 cause lag on labsdb s5, like last week?
[09:47:39] <wikibugs>	 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) @Chtnnh: See https://phabricator.wikimedia.org/project/profile/1305/ ; apart from that see the point persons that you listed, I'd say?
[09:57:10] <wikibugs>	 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) I meant someone from the Service deployment team, so that it can be on their radar. T214201 is waiting on this task.
[09:58:13] <_joe_>	 Wiki13: the server is up and has 0 lag
[09:58:53] <Wiki13>	 well i saw replag climbing on s5 when looking at replag.toolforge.org, hence i was asking :)
[10:00:04] <_joe_>	 ok I'm just looking at sanitarium, not labsdb
[10:00:34] <_joe_>	 so maybe the problem is there? but I'd be surprised
[10:02:25] <_joe_>	 Wiki13: uhm I don't see high lag for s5 on replag, am I missing something? I'm not familiar with the tool
[10:02:59] <_joe_>	 it seems to depend on when I refresh the page, which is weird
[10:03:02] <Wiki13>	 the tool is a bit weird, sometimes its 0 and sometimes not
[10:03:44] <Wiki13>	 guessing its probably multiple servers where one has no lag at all and the other one that was replicating from 1082 causes a high value
[10:04:23] <Wiki13>	 causing errors to be given in several tools that show replag warnings
[10:05:22] <_joe_>	 to be clear, db1082 is up and sanitarium is replicating correctly
[10:05:38] <_joe_>	 I don't know enough about labs replicas, I'll need to read the docs
[10:05:44] <_joe_>	 kormat: do you know anything more?
[10:07:45] <Wiki13>	 FYI, I was refering to https://sal.toolforge.org/log/ihmzY3MBj_Bg1xd3cEYk
[10:08:28] <_joe_>	 that was because the db actually crashed and was recovering
[10:08:33] <_joe_>	 now it's back up
[10:10:20] <Wiki13>	 Ok. Was just verifying if that was applicable to this incident aswell, it seems not. I have no further questions now
[10:10:59] <kormat>	 _joe_: it should be fine, but I'll take a look
[10:11:45] <_joe_>	 ok it's labsdb1009.eqiad.wmnet
[10:11:52] <_joe_>	 it has high lag for all shards
[10:12:08] <_joe_>	 Wiki13: thanks for reporting though
[10:12:11] <kormat>	 that host has been on the struggle bus this week
[10:12:26] <_joe_>	 it wasn't related but it looks like an issue
[10:12:28] <Wiki13>	 ah I see, that explains it
[10:12:39] <kormat>	 m.anuel has been keeping an eye on it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/615725
[10:13:09] <_joe_>	 ah ok, yeah it seems like a good idea maybe
[10:13:30] <_joe_>	 Wiki13: it will be all shards, sadly, not just s5
[10:14:24] <Wiki13>	 well, guess then we have to live with some lag a bit longer on toolforge until it clears up
[10:14:35] <Wiki13>	 thanks for the info :)
[10:15:17] <kormat>	 yeah all shards on labsdb1009 are on average about 5000s behind
[10:33:02] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] Remove bogus $wgWMEPhp7SamplingRate setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609494 (https://phabricator.wikimedia.org/T219127) (owner: 10Krinkle)
[10:42:59] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Interesting. So messages from MW start with "ERR" and became "ERROR" and from php7-fatal-handler, they came in as "err" and became ERR?" [puppet] - 10https://gerrit.wikimedia.org/r/616116 (https://phabricator.wikimedia.org/T248181) (owner: 10Herron)
[10:46:50] <icinga-wm>	 PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:54:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:55:20] <wikibugs>	 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Aklapper) @Chtnnh: See members listed on https://phabricator.wikimedia.org/project/profile/1305/
[10:55:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:56:10] <icinga-wm>	 RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:10:42] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper)
[11:28:08] <icinga-wm>	 PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:29:54] <icinga-wm>	 RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:30:54] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 57 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:33:16] <wikibugs>	 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) Thank you @Aklapper   @akosiaris Hey! Chaitanya here, a volunteer from the scoring platform team. Need your help deploying this service...
[11:36:46] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 44 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:03:46] <icinga-wm>	 PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:07:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:10:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:16:50] <icinga-wm>	 RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:34:42] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) Interestingly the OS still sees the BBU (which is something we've seen sometimes too). `    Battery/Capacitor Count: 1 `  This means that the above crash can still happen again. It would be better if the BBU would be...
[12:34:52] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) p:05Medium→03High
[12:35:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:39:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:41:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1096:3315 into s5 api afte db1082 crashed T258336', diff saved to https://phabricator.wikimedia.org/P12041 and previous config saved to /var/cache/conftool/dbconfig/20200725-124104-marostegui.json
[12:41:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:12] <stashbot>	 T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336
[17:55:51] <wikibugs>	 (03PS1) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601)
[17:57:39] <wikibugs>	 (03PS6) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450)
[18:20:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:21:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:05:35] <wikibugs>	 (03PS2) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601)
[19:11:54] <wikibugs>	 (03PS8) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601)
[19:12:26] <wikibugs>	 (03PS3) 10Privacybatm: [POC4 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/616282 (https://phabricator.wikimedia.org/T257601)
[20:05:28] <icinga-wm>	 PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops
[20:28:00] <icinga-wm>	 RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops
[21:03:13] <wikibugs>	 10Operations, 10Readers-Web-Backlog, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Ferdi2005) @ovasileva it.Wikinews is really poorly indicizated, an article "Apple passa ad ARM e annuncia altre n...
[23:48:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:49:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets