[00:00:04] Deploy window Veteran's Day (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191111T0000) [00:15:35] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:45] PROBLEM - Check size of conntrack table on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:16:13] PROBLEM - SSH on analytics-tool1001 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:16:17] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:16:21] PROBLEM - configured eth on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:16:29] PROBLEM - Check whether ferm is active by checking the default input chain on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:16:55] PROBLEM - Disk space on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [00:17:05] PROBLEM - dhclient process on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:17:05] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:28:36] 10Operations, 10Traffic, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10bd808) [00:33:43] PROBLEM - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [00:40:15] RECOVERY - SSH on analytics-tool1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:03:33] RECOVERY - Check size of conntrack table on analytics-tool1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:04:03] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:04:07] RECOVERY - configured eth on analytics-tool1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:04:17] RECOVERY - Check whether ferm is active by checking the default input chain on analytics-tool1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:04:19] RECOVERY - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is OK: OK: synced at Mon 2019-11-11 01:04:16 UTC. https://wikitech.wikimedia.org/wiki/NTP [01:04:39] RECOVERY - Disk space on analytics-tool1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [01:04:51] RECOVERY - dhclient process on analytics-tool1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [01:04:57] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:23] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:31:47] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:03] RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [02:30:08] (03PS7) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [02:31:22] (03CR) 10CRusnov: "> Patch Set 6: Code-Review-1" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [02:32:44] (03CR) 10CRusnov: "> Patch Set 7:" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [02:49:50] (03PS1) 10CRusnov: coherence: Alert on ACTIVE devices with names future- or spare. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) [03:55:29] (03PS1) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [04:05:51] (03PS1) 10CRusnov: hieradata/netbox: Add accounting report to alerts [puppet] - 10https://gerrit.wikimedia.org/r/550053 [04:25:33] (03PS1) 10Reedy: Remove wgTorLoadNodes as it was removed in b5ccbe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550055 [04:31:18] (03PS1) 10Reedy: Remove cron tor_exit_node_update from profile::openstack::codfw1dev::wikitech::web [puppet] - 10https://gerrit.wikimedia.org/r/550057 (https://phabricator.wikimedia.org/T156733) [04:32:53] (03CR) 10Reedy: Remove cron tor_exit_node_update from profile::openstack::codfw1dev::wikitech::web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550057 (https://phabricator.wikimedia.org/T156733) (owner: 10Reedy) [04:33:40] (03CR) 10jerkins-bot: [V: 04-1] Remove cron tor_exit_node_update from profile::openstack::codfw1dev::wikitech::web [puppet] - 10https://gerrit.wikimedia.org/r/550057 (https://phabricator.wikimedia.org/T156733) (owner: 10Reedy) [04:34:43] (03PS2) 10Reedy: Remove tor_exit_node_update cron from wikitech::web [puppet] - 10https://gerrit.wikimedia.org/r/550057 (https://phabricator.wikimedia.org/T156733) [05:21:23] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/550061 (https://phabricator.wikimedia.org/T231627) [05:21:25] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/550062 (https://phabricator.wikimedia.org/T231627) [05:27:34] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3058 [puppet] - 10https://gerrit.wikimedia.org/r/550061 (https://phabricator.wikimedia.org/T231627) [05:27:36] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3058 [puppet] - 10https://gerrit.wikimedia.org/r/550062 (https://phabricator.wikimedia.org/T231627) [05:27:57] !log Switch from nginx to ats-tls on cp3058 - T231627 [05:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:03] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [05:29:31] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3058 [puppet] - 10https://gerrit.wikimedia.org/r/550061 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:31:24] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3058 [puppet] - 10https://gerrit.wikimedia.org/r/550062 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:40:07] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:12:45] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) >>! In T237702#5647039, @jcrespo wrote: > @Papaul Yes, the rack proposal seems ok. > @Marostegui Let's consider installing buster on new hosts starting now, even if that means in... [06:34:00] 10Operations, 10ops-codfw, 10DBA: Upgrade db2072 firmware and bios - https://phabricator.wikimedia.org/T237905 (10Marostegui) [06:34:13] 10Operations, 10ops-codfw, 10DBA: Upgrade db2072 firmware and bios - https://phabricator.wikimedia.org/T237905 (10Marostegui) p:05Triage→03Normal [06:44:20] !log Delete globalblocks table from napwikisource T230055 [06:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:26] T230055: Remove globalblocks tables from wikis - https://phabricator.wikimedia.org/T230055 [06:56:47] !log delete /etc/logrotate.d/wdqs-reload-categories from wdqs* as attempt to reduce cronspam [06:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:12] gehel, onimisionipe --^ - o/, I think the above was duplicated, also not in puppet [07:28:39] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) >>! In T228258#5614366, @jcrespo wrote: > I noticed db2062 isn't set on m1, is that on purpose because it is going to be decommissioned? Or because we didn't want it to alert? Or something else? CC @Maros... [07:33:05] (03CR) 10Marostegui: [C: 03+1] mariadb package: Add 10.1.42 packages [software] - 10https://gerrit.wikimedia.org/r/548967 (owner: 10Jcrespo) [08:03:42] (03PS5) 10Ema: ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) [08:06:11] (03CR) 10jerkins-bot: [V: 04-1] ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) (owner: 10Ema) [08:08:42] 10Operations, 10DBA: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Marostegui) [08:09:02] 10Operations, 10DBA: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Marostegui) p:05Triage→03Normal [08:12:30] elukey: thanks! looks like a leftover from our recent refactoring [08:20:56] (03PS1) 10Marostegui: mariadb: Set db2048 as spare [puppet] - 10https://gerrit.wikimedia.org/r/550089 (https://phabricator.wikimedia.org/T237913) [08:21:09] !log Remove db2048 from tendril and zarcillo T237913 [08:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:15] T237913: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 [08:24:13] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add lists server mtail scrape to mtail jobs [puppet] - 10https://gerrit.wikimedia.org/r/549179 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [08:24:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2048 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550091 (https://phabricator.wikimedia.org/T237913) [08:24:35] (03PS1) 10ArielGlenn: enable rsync to dumpsdata1003 for all dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/550090 (https://phabricator.wikimedia.org/T219768) [08:24:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Haven't tested the resulting metrics but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548938 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [08:25:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2048 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550091 (https://phabricator.wikimedia.org/T237913) (owner: 10Marostegui) [08:26:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2048 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550091 (https://phabricator.wikimedia.org/T237913) (owner: 10Marostegui) [08:26:22] 10Operations, 10Wikimedia-Logstash, 10Privacy: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10MoritzMuehlenhoff) p:05Triage→03Normal We're in the process of rolling out Apereo CAS (and initial services are getting migrated to it), s... [08:27:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2048 from config T237913 (duration: 00m 54s) [08:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:28] T237913: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 [08:27:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2048 as spare [puppet] - 10https://gerrit.wikimedia.org/r/550089 (https://phabricator.wikimedia.org/T237913) (owner: 10Marostegui) [08:27:48] 10Operations, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10MoritzMuehlenhoff) This task misses a rationale, what do we need it for on the non-labweb mw* servers? Anything which will be rolled out in the future? [08:27:54] 10Operations, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:28:21] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2048 from config T237913 (duration: 00m 51s) [08:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:38] !log Stop MySQL on db2048 before decommissioning - T237913 [08:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:55] (03CR) 10ArielGlenn: [C: 03+2] enable rsync to dumpsdata1003 for all dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/550090 (https://phabricator.wikimedia.org/T219768) (owner: 10ArielGlenn) [08:29:03] 10Operations, 10DBA: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Marostegui) [08:30:38] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, I'm +1 on the idea and getting this off the ground." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [08:31:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add lists server mtail scrape to mtail jobs [puppet] - 10https://gerrit.wikimedia.org/r/549179 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [08:43:17] (03PS1) 10KartikMistry: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) [08:43:28] (03CR) 10jerkins-bot: [V: 04-1] New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [08:43:49] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10jcrespo) I really meant 10.1, as a stop-gap measure until a final decision for database on buster is done, but to prevent a second reimage later on (upgrading just the package or copying it... [08:45:16] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) >>! In T237702#5652295, @jcrespo wrote: > I really meant 10.1, as a stop-gap measure until a final decision for database on buster is done, but to prevent a second reimage later... [08:47:44] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10jcrespo) What about db1135, then (probably others)? Is it on purpose or WIP, or something else? I didn't want to break anything, sorry. [08:49:40] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [08:49:51] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) @Papaul remember that I already included those hosts into into their correct partman recipe and spare role for now: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/543748/... [08:50:02] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10jcrespo) So, any package would be cool to me- I just would like to avoid you extra work for an extra reimage. [08:50:08] 10Operations, 10Wikimedia-Logstash, 10Privacy: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10fgiunchedi) Indeed what @MoritzMuehlenhoff said, we'll gain 2FA when CAS gets deployed more widely. Regarding the first point @awight where wo... [08:53:41] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [08:55:30] 10Operations, 10User-jbond: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10fgiunchedi) >>! In T237649#5646071, @Ottomata wrote: > Removed a buncha stuff! Ditto! [09:00:09] (03PS1) 10Filippo Giunchedi: monitoring: add alerts for ats availability [puppet] - 10https://gerrit.wikimedia.org/r/550094 (https://phabricator.wikimedia.org/T236482) [09:08:01] (03CR) 10Vgutierrez: [C: 04-1] "I don't know how the current thresholds are going to work with ats-tls, some slow/failed POSTS that would result on a 408 with nginx, now " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550094 (https://phabricator.wikimedia.org/T236482) (owner: 10Filippo Giunchedi) [09:09:23] !log volker-e@deploy1001 Started deploy [design/style-guide@0ea65f2]: Deploy design/style-guide: [09:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:31] !log volker-e@deploy1001 Finished deploy [design/style-guide@0ea65f2]: Deploy design/style-guide: (duration: 00m 07s) [09:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:19] 10Operations, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10Krenair) Probably because this is a logical blocker to the parent task. [09:13:10] (03PS6) 10Ema: ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) [09:16:55] (03PS7) 10Ema: ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) [09:19:28] (03CR) 10Vgutierrez: [C: 03+1] "looking good :)" [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) (owner: 10Ema) [09:29:20] (03PS8) 10Ema: ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) [09:32:56] (03PS1) 10Elukey: profile::kerberos::client: add MOTD to help users [puppet] - 10https://gerrit.wikimedia.org/r/550099 (https://phabricator.wikimedia.org/T237269) [09:33:55] (03PS9) 10Ema: ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) [09:37:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, once Kerberos is more mainstream, we can probably make the banner smaller :-)" [puppet] - 10https://gerrit.wikimedia.org/r/550099 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [09:39:19] (03CR) 10Elukey: [C: 03+2] profile::kerberos::client: add MOTD to help users [puppet] - 10https://gerrit.wikimedia.org/r/550099 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [09:39:27] (03CR) 10Ema: [C: 03+2] ATS: X-Wikimedia-Debug request routing implementation [puppet] - 10https://gerrit.wikimedia.org/r/549840 (https://phabricator.wikimedia.org/T237687) (owner: 10Ema) [09:39:42] moritzm: we could also add the kerberos logo with asciiart! :D [09:40:01] elukey: OK to merge your art? [09:40:13] ema: <3 [09:41:15] !log test x-wikimedia-debug-routing.lua on cp4027 (depooled) T237687 [09:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:20] T237687: ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 [09:42:30] elukey: or some old school Amiga/C64 graphics demo! [09:42:45] :D [09:46:53] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01385 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:47:24] it is probably due to me --^ [09:47:27] fixing now [09:48:35] (03PS1) 10Elukey: profile::kerberos::client: set up motd properly [puppet] - 10https://gerrit.wikimedia.org/r/550100 [09:52:40] (03CR) 10Marostegui: "I assume this is the one that has been installed&tested on db1114?" [software] - 10https://gerrit.wikimedia.org/r/546455 (owner: 10Jcrespo) [09:54:08] (03PS1) 10Elukey: Add fake kerberos keytabs for stat nodes [labs/private] - 10https://gerrit.wikimedia.org/r/550101 [09:54:37] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for stat nodes [labs/private] - 10https://gerrit.wikimedia.org/r/550101 (owner: 10Elukey) [09:55:07] (03PS1) 10Jcrespo: mariadb-package: Updates for mariadb 10.1.43 and percona-server 8.0.17 [software] - 10https://gerrit.wikimedia.org/r/550102 [09:55:44] (03CR) 10Jcrespo: "There is a new version that got just uploaded." [software] - 10https://gerrit.wikimedia.org/r/546455 (owner: 10Jcrespo) [09:56:41] nice win on not spamming puppet alerts <3 [09:57:46] (03CR) 10Elukey: [C: 03+2] profile::kerberos::client: set up motd properly [puppet] - 10https://gerrit.wikimedia.org/r/550100 (owner: 10Elukey) [09:58:06] godog: yep! [10:03:08] (03CR) 10Marostegui: [C: 03+1] mariadb-package: Updates for mariadb 10.1.43 and percona-server 8.0.17 [software] - 10https://gerrit.wikimedia.org/r/550102 (owner: 10Jcrespo) [10:03:58] (03PS1) 10Ema: ATS: skip the cache if X-Wikimedia-Debug is valid [puppet] - 10https://gerrit.wikimedia.org/r/550103 (https://phabricator.wikimedia.org/T237687) [10:06:33] (03CR) 10jerkins-bot: [V: 04-1] ATS: skip the cache if X-Wikimedia-Debug is valid [puppet] - 10https://gerrit.wikimedia.org/r/550103 (https://phabricator.wikimedia.org/T237687) (owner: 10Ema) [10:07:16] (03PS2) 10Ema: ATS: skip the cache if X-Wikimedia-Debug is valid [puppet] - 10https://gerrit.wikimedia.org/r/550103 (https://phabricator.wikimedia.org/T237687) [10:07:23] 10Operations, 10cloud-services-team: Failing puppet runs on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T235819 (10jcrespo) labtestpuppetmaster2001 is failing because it doesn't have any full backup with > 0 bytes. I assume var-lib-puppet-volatile was empty at some point and then it got populat... [10:10:41] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002915 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:11:23] (03CR) 10Vgutierrez: [C: 03+1] ATS: skip the cache if X-Wikimedia-Debug is valid [puppet] - 10https://gerrit.wikimedia.org/r/550103 (https://phabricator.wikimedia.org/T237687) (owner: 10Ema) [10:11:53] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 95 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [10:12:14] !log manually run full backup of labtestpuppetmaster2001 T235819 [10:12:16] (03CR) 10Ema: [C: 03+2] ATS: skip the cache if X-Wikimedia-Debug is valid [puppet] - 10https://gerrit.wikimedia.org/r/550103 (https://phabricator.wikimedia.org/T237687) (owner: 10Ema) [10:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:19] T235819: Failing puppet runs on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T235819 [10:16:26] !log repool cp4027 after successful X-Wikimedia-Debug testing P9585 T237687 [10:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:42] T237687: ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 [10:21:12] !log upgrade mariadb on db2102 [10:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:28] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) 05Open→03Resolved The functionality is now deployed to production, a brief illustration follows. Valid XWD header: ` $ curl -s -v -H "X-Wik... [10:32:35] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) - Base install is ready thanks to Chris. - Resized the lvm and the filesystem for /data so that's ready to go. - rsync running in screen on dumpsdata1003 pulling last two go... [10:32:48] !log restarting ats-tls on cp1088 [10:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:05] RECOVERY - traffic_server tls process restarted on cp1088 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1088&var-layer=tls [10:37:17] 10Operations, 10Traffic: Remove debug proxies once all Varnish backends are gone - https://phabricator.wikimedia.org/T237932 (10ema) [10:37:23] 10Operations, 10Traffic: Remove debug proxies once all Varnish backends are gone - https://phabricator.wikimedia.org/T237932 (10ema) p:05Triage→03Normal [10:41:07] (03CR) 10Jcrespo: [C: 04-1] "There is some extra patch and discussion involved on this we should discuss, FYI. (this is just for package creation, nothing to do with m" [software] - 10https://gerrit.wikimedia.org/r/550102 (owner: 10Jcrespo) [10:45:07] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [10:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:33] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) [10:46:37] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [10:47:18] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:35] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [10:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:36] !log Updated the Wikidata property suggester with data from the 2019-11-04 JSON dump and applied the T132839 workarounds [10:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:41] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [10:59:40] (03PS1) 10Ema: cache: reimage cp3050 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550105 (https://phabricator.wikimedia.org/T227432) [10:59:43] (03PS1) 10Ema: cache_text esams: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/550106 (https://phabricator.wikimedia.org/T227432) [11:07:01] (03CR) 10Arturo Borrero Gonzalez: [C: 04-2] "We still have a couple of HW servers using mitaka packages (labmon, labstores). Also, many VMs too." [puppet] - 10https://gerrit.wikimedia.org/r/549814 (owner: 10Muehlenhoff) [11:21:45] (03PS2) 10Filippo Giunchedi: monitoring: add alerts for ats availability [puppet] - 10https://gerrit.wikimedia.org/r/550094 (https://phabricator.wikimedia.org/T236482) [11:23:28] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/550094 (https://phabricator.wikimedia.org/T236482) (owner: 10Filippo Giunchedi) [11:46:11] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 27965 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [11:46:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet-export-facts: use the certificate provided by localcacert [puppet] - 10https://gerrit.wikimedia.org/r/549857 (https://phabricator.wikimedia.org/T214472) (owner: 10Jbond) [11:48:07] 10Operations, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: puppet: compiler-update-facts error and warning - https://phabricator.wikimedia.org/T214472 (10aborrero) Thanks for working on this! +1 to your patch. Anyway, I ran the script the other day and I found no issues. Feel free to close this ta... [11:50:57] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [11:55:34] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [11:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:20] (03PS1) 10Ladsgroup: Revert "mediawiki: Make the rebuildItemTerms script slower" [puppet] - 10https://gerrit.wikimedia.org/r/550113 [12:19:55] (03CR) 10Mobrovac: "I really think the best way forward here is to have Flow declare a new variable that tells it whether to use VRS or wgFlowParsoidURL. That" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [12:21:21] !log Upgrade mw2* to 7.2.24-1 with elegance and restart php-fpm - T231881 [12:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:26] T231881: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 [12:21:32] (03PS2) 10Mobrovac: [Beta] Flow: Use Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) [12:22:32] (03CR) 10Mobrovac: [Beta] Flow: Use Parsoid/PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [12:28:16] !log Upgrade mw2* to 7.2.24-1 with elegance and restart php-fpm - T237239 [12:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:20] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [12:30:39] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@2cb2dde]: Deploy updates on wdqs1010 [12:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:07] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@2cb2dde]: Deploy updates on wdqs1010 (duration: 00m 28s) [12:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:42:25] (03CR) 10Alaa Sarhan: [C: 03+1] Revert "mediawiki: Make the rebuildItemTerms script slower" [puppet] - 10https://gerrit.wikimedia.org/r/550113 (owner: 10Ladsgroup) [12:46:52] !log Upgrade to 7.2.24-1 mwdebug[2001-2002].codfw.wmnet,mwmaint2001.codfw.wmnet,deploy2001.codfw.wmnet - T237239 [12:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:00] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [12:54:53] (03CR) 10Arturo Borrero Gonzalez: ceph: add ceph storage cluster profiles and modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [12:59:49] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [12:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:03] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:16:36] (03PS1) 10Cmjohnson: Adding elastic10[53-67].eqiad.wmnet to site.pp role::spare [puppet] - 10https://gerrit.wikimedia.org/r/550116 (https://phabricator.wikimedia.org/T230746) [13:18:11] (03CR) 10Cmjohnson: [C: 03+2] Adding elastic10[53-67].eqiad.wmnet to site.pp role::spare [puppet] - 10https://gerrit.wikimedia.org/r/550116 (https://phabricator.wikimedia.org/T230746) (owner: 10Cmjohnson) [13:21:48] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Cmjohnson) [13:23:25] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Cmjohnson) a:05Cmjohnson→03Gehel @gelhel The new elasticsearch servers are installed and ready for you, I have assigned you... [13:24:29] (03PS2) 10Ema: cache: reimage cp3050 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550105 (https://phabricator.wikimedia.org/T227432) [13:25:40] !log depool cp3050 and reimage as text_ats T227432 [13:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:45] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:26:15] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Gehel) @Cmjohnson thanks! I'll have a look and let you know if there are any issues! [13:26:19] (03CR) 10Ema: [C: 03+2] cache: reimage cp3050 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/550105 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:27:46] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3050.esams.wmnet'] ` The log can be found in `/var/log/wm... [13:33:31] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp3050_v4,cp3050_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:34:11] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536} site=eqiad tunnel={cp3050_v4,cp3050_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:36:22] looks like those are expected [13:36:27] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp3050_v4,cp3050_v6} Ema Reimaging cp3050 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:36:27] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536} site=eqiad tunnel={cp3050_v4,cp3050_v6} Ema Reimaging cp3050 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:36:37] yup! [13:41:09] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) [13:48:38] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:46] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:06] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:59:37] (03PS9) 10Ammarpad: Rename DPL extension variable to non-ambiguous name, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (https://phabricator.wikimedia.org/T237698) [13:59:47] (03PS3) 10Ammarpad: Rename DPL extension variable to non-ambiguous name, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549666 (https://phabricator.wikimedia.org/T237698) [13:59:52] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:59:58] (03PS3) 10Ammarpad: Rename DPL extension variable to non-ambiguous name, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549697 (https://phabricator.wikimedia.org/T237698) [14:00:22] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:02:58] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3050.esams.wmnet'] ` and were **ALL** successful. [14:03:34] (03CR) 10Ema: [C: 03+2] cache_text esams: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/550106 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:03:41] (03PS2) 10Ema: cache_text esams: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/550106 (https://phabricator.wikimedia.org/T227432) [14:05:35] 10Operations, 10SRE-tools: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 (10MoritzMuehlenhoff) [14:05:39] 10Operations, 10SRE-tools: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 (10MoritzMuehlenhoff) p:05Triage→03Normal [14:06:43] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - eqiad on icinga1001 is CRITICAL: 1.26e+06 ge 2.592e+05 Mathew.onipe https://phabricator.wikimedia.org/T237228 - The acknowledgement expires at: 2019-11-14 14:06:21. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [14:07:28] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 1.26e+06 ge 2.592e+05 Mathew.onipe phabricator.wikimedia.org/T237228 - The acknowledgement expires at: 2019-11-14 14:07:06. https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [14:07:37] (03CR) 10Marostegui: "Do you need me to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/550113 (owner: 10Ladsgroup) [14:08:19] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/550113 (owner: 10Ladsgroup) [14:09:06] (03PS2) 10Marostegui: Revert "mediawiki: Make the rebuildItemTerms script slower" [puppet] - 10https://gerrit.wikimedia.org/r/550113 (owner: 10Ladsgroup) [14:09:56] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:11:40] ACKNOWLEDGEMENT - DPKG on labtestpuppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Muehlenhoff T237982 https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:11:42] 10Operations: Broken package state on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T237982 (10MoritzMuehlenhoff) [14:11:57] (03CR) 10Marostegui: [C: 03+2] Revert "mediawiki: Make the rebuildItemTerms script slower" [puppet] - 10https://gerrit.wikimedia.org/r/550113 (owner: 10Ladsgroup) [14:13:46] (03PS1) 10Muehlenhoff: Extend debmonitor config with option to add links to images [puppet] - 10https://gerrit.wikimedia.org/r/550245 (https://phabricator.wikimedia.org/T237978) [14:18:57] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) @Jclark-ctr I am not able to login to mgmt, can you verify that the IP, Gateway and subnet are correct. [14:19:20] (03PS1) 10Ema: ATS: use nvme disk for cp3050 ats-be cache [puppet] - 10https://gerrit.wikimedia.org/r/550246 (https://phabricator.wikimedia.org/T227432) [14:21:18] (03PS2) 10Ema: ATS: use nvme disk for cp3050 ats-be cache [puppet] - 10https://gerrit.wikimedia.org/r/550246 (https://phabricator.wikimedia.org/T227432) [14:23:00] (03CR) 10Ema: [C: 03+2] ATS: use nvme disk for cp3050 ats-be cache [puppet] - 10https://gerrit.wikimedia.org/r/550246 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:24:36] PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp3050 is CRITICAL: CRITICAL: /var/log/trafficserver/notpurge.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:06] RECOVERY - check_trafficserver_log_fifo_notpurge_backend on cp3050 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /var/log/trafficserver/notpurge.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:26:41] !log pool cp3050 with ATS backend T227432 [14:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:47] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:54:18] RECOVERY - Check systemd state on labtestpuppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:42] RECOVERY - DPKG on labtestpuppetmaster2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:56:11] (03PS19) 10Jhedden: ceph: add ceph storage cluster profiles and modules [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [14:58:56] (03CR) 10Jhedden: ceph: add ceph storage cluster profiles and modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [15:03:04] 10Operations: Broken package state on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T237982 (10Andrew) 05Open→03Resolved [15:11:44] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:13:43] 10Operations, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) >>! In T237889#5652216, @MoritzMuehlenhoff wrote: > This task misses a rationale, what do we need it for on the non-labweb mw* servers? Anything which will be rolled out in the fu... [15:26:58] 10Operations, 10Analytics, 10Traffic: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10elukey) [15:30:04] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [15:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:23] Jumbo cluster --^ [15:33:24] (03PS6) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [15:35:47] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [15:42:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [15:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:05] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [15:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:27] Jumbo cluster --^ [15:56:42] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:16:28] (03PS7) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [16:19:11] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [16:19:32] (03CR) 10Marostegui: "@cdanis, anything that needs changing from the dbconfig point of view?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [16:22:21] (03CR) 10Jcrespo: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [16:23:19] (03CR) 10Jcrespo: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [16:23:25] (03PS8) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [16:25:25] ACKNOWLEDGEMENT - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] Gehel replication is failing, so no new tile generation: https://phabricator.wikimedia.org/T237228 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [16:26:04] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [16:33:45] (03PS9) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [16:59:23] (03CR) 10Ayounsi: "Discussed over IRC, one comment so far." (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:07:54] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 24923 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [17:09:28] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:15:29] (03PS2) 10Jhedden: install_server: add cloudcephosd partman config [puppet] - 10https://gerrit.wikimedia.org/r/549928 (https://phabricator.wikimedia.org/T228102) [17:15:31] (03PS1) 10Jhedden: ceph: add spare::system role to ceph mon and osd [puppet] - 10https://gerrit.wikimedia.org/r/550339 (https://phabricator.wikimedia.org/T228102) [17:15:39] 10Operations, 10Puppet, 10observability: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10ayounsi) p:05Triage→03Normal [17:16:37] (03CR) 10Ayounsi: [C: 03+1] ceph: add spare::system role to ceph mon and osd [puppet] - 10https://gerrit.wikimedia.org/r/550339 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [17:16:51] (03CR) 10Jhedden: [C: 03+2] ceph: add spare::system role to ceph mon and osd [puppet] - 10https://gerrit.wikimedia.org/r/550339 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [17:17:59] gehel: ^ is the disk space alert for you too? [17:18:56] (03CR) 10Jhedden: [C: 03+2] install_server: add cloudcephosd partman config [puppet] - 10https://gerrit.wikimedia.org/r/549928 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [17:20:17] XioNoX: thanks! Should resolve by itself in a bit. And we finally have the new elastic servers ready ! [17:20:25] cool! [17:20:35] Should be a lot smoother tomorrow once they are configured ! [17:20:48] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [17:21:06] (03PS10) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [17:22:20] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephmon1001.... [17:23:44] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [17:32:06] (03PS11) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [17:41:55] !log jeh@cumin1001 START - Cookbook sre.hosts.downtime [17:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:20] (03CR) 10Arturo Borrero Gonzalez: ceph: add ceph storage cluster profiles and modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [17:44:02] !log jeh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:11] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephmon1001.wikimedia.org'] ` Of which those **FAILED**: `... [17:53:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [17:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:02] \o/ [17:59:12] nice! [18:01:29] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@222b1c2]: New WDQS build - 0.3.6-SNAPSHOT [18:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:44] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@222b1c2]: New WDQS build - 0.3.6-SNAPSHOT (duration: 15m 14s) [18:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:13] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@222b1c2]: New WDQS build - 0.3.6-SNAPSHOT [18:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:10] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@222b1c2]: New WDQS build - 0.3.6-SNAPSHOT (duration: 00m 57s) [18:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:46] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: specify default backend for nginx-ingress [puppet] - 10https://gerrit.wikimedia.org/r/550347 (https://phabricator.wikimedia.org/T234032) [18:38:07] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) The above rsync completed; I will be rerunning it from time to time. In the meantime I have now moved onto the 'misc' dumps: rsync -av labstore1006.wikimedia.org::data/xmlda... [18:50:19] (03PS1) 10Urbanecm: [beta] Set wgGERestbaseUrl to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550350 (https://phabricator.wikimedia.org/T238011) [18:52:41] (03PS1) 10ArielGlenn: add dumpsdata1003 to peer hosts for rsync, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/550352 [18:55:55] (03CR) 10ArielGlenn: [C: 03+2] add dumpsdata1003 to peer hosts for rsync, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/550352 (owner: 10ArielGlenn) [18:58:25] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) I see that I did not bwlimit the labstore rsync, though in my earlier 20 attempts to get the rsync args right, I did have that in there. It will be limited for any catchup runs. [19:05:54] 10Operations, 10serviceops: mw1239 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T238018 (10ayounsi) p:05Triage→03High [19:06:19] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 10 ge 4 Ayounsi https://phabricator.wikimedia.org/T238018 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [19:07:05] (03CR) 10CDanis: [C: 04-1] "As is I think this will break wikitech, as we need to also make this change reflected in etcd. Which will require some synchronization. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [19:17:40] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Gehel) [19:25:38] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:27:50] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:35:57] chaomodus: ^ [19:36:17] !log disable ALGs on mr1-esams [19:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:52] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:28] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:38:30] (03PS1) 10Ayounsi: Add security alg/forwarding-options/screen to mr template [homer/public] - 10https://gerrit.wikimedia.org/r/550356 [20:12:42] 10Operations, 10serviceops: mw1239 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T238018 (10Joe) @ayounsi this server is being decommissioned in a few weeks, I don't think it should be fixed at all, we can just acknowledge the alert. [20:17:00] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:19:36] 10Operations, 10serviceops: mw1239 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T238018 (10ayounsi) 05Open→03Declined wfm. [20:33:46] (03PS12) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [21:25:10] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:10] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:32:56] (03PS1) 10Ayounsi: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 [21:35:32] (03CR) 10jerkins-bot: [V: 04-1] Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [21:36:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:48] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:40:40] (03PS2) 10Ayounsi: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 [21:43:18] (03CR) 10jerkins-bot: [V: 04-1] Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [21:48:07] (03CR) 10Ayounsi: "1/ I'll need help to write the tests." [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [21:49:59] (03PS1) 10Ayounsi: Add virtual-chassis support [homer/public] - 10https://gerrit.wikimedia.org/r/550370 [21:54:23] PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100% [21:57:58] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:58:01] (03CR) 10Ayounsi: "For some reasons codfw VCs have an explicit VC ID configured, eg. "virtual-chassis id 86e4.7680.72a3"." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [22:21:56] (03PS13) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [22:41:13] (03PS1) 10Ayounsi: Add vlan support for asw [software/homer] - 10https://gerrit.wikimedia.org/r/550375 [22:42:26] (03PS1) 10Ayounsi: Add vlan support for asw [homer/public] - 10https://gerrit.wikimedia.org/r/550376 [22:43:27] (03CR) 10jerkins-bot: [V: 04-1] Add vlan support for asw [software/homer] - 10https://gerrit.wikimedia.org/r/550375 (owner: 10Ayounsi) [22:45:06] (03CR) 10Ayounsi: Add vlan support for asw (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550376 (owner: 10Ayounsi) [22:48:38] uh looks like cp3065 crashed [22:49:15] (03PS2) 10Ayounsi: Add vlan support for asw [homer/public] - 10https://gerrit.wikimedia.org/r/550376 [22:49:48] !log power-cycle cp3065, currently down [22:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:35] (03CR) 10Ayounsi: "Same comments as of parent change." [software/homer] - 10https://gerrit.wikimedia.org/r/550375 (owner: 10Ayounsi) [22:51:23] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3065.esams.wmnet [22:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:36] RECOVERY - Host cp3065 is UP: PING OK - Packet loss = 0%, RTA = 83.42 ms [22:55:44] I'm gonna leave it depooled for further checks tomorrow [22:59:46] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:11:00] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:40:56] (03Abandoned) 10Paladox: gerrit: Remove symlink to mysql java connector [puppet] - 10https://gerrit.wikimedia.org/r/488099 (owner: 10Paladox) [23:41:10] (03PS7) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 [23:43:29] (03PS7) 10Paladox: gerrit: Switch db from mysql to H2 [puppet] - 10https://gerrit.wikimedia.org/r/488093 (https://phabricator.wikimedia.org/T211139)