[00:00:06] PROBLEM - Maps - OSM synchronization lag - codfw on einsteinium is CRITICAL: 1.728e+05 ge 1.728e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [00:00:07] PROBLEM - Maps - OSM synchronization lag - eqiad on einsteinium is CRITICAL: 1.728e+05 ge 1.728e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [00:02:33] that looks like always the same pattern in the graph and that it will recover in a second [00:04:31] always catches up at 2 days [00:11:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 66 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [00:26:16] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:30:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:04:32] PROBLEM - configured eth on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:06:22] PROBLEM - dhclient process on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:07:06] new pc node that is me doing new install [01:08:03] PROBLEM - puppet last run on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:08:03] PROBLEM - Check systemd state on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:09:53] PROBLEM - Check the NTP synchronisation status of timesyncd on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:09:54] PROBLEM - puppet last run on pc2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [01:11:43] PROBLEM - DPKG on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:15:32] PROBLEM - Disk space on pc2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:20:12] PROBLEM - Host pc2007 is DOWN: PING CRITICAL - Packet loss = 100% [01:21:13] RECOVERY - configured eth on pc2007 is OK: OK - interfaces up [01:21:22] RECOVERY - Host pc2007 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [01:21:23] RECOVERY - DPKG on pc2007 is OK: All packages OK [01:21:52] RECOVERY - Check systemd state on pc2007 is OK: OK - running: The system is fully operational [01:22:02] RECOVERY - Disk space on pc2007 is OK: DISK OK [01:22:12] RECOVERY - dhclient process on pc2007 is OK: PROCS OK: 0 processes with command name dhclient [01:23:22] PROBLEM - puppet last run on pc2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 44 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [01:26:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) ``` pc2007 root@pc2007:~# fdisk -l Disk /dev/sda: 4.4 TiB, 4799217008640 bytes, 9373470720 sectors Units: sectors of 1 * 512 = 512 bytes Sector si... [01:27:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) [01:29:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) a:05Papaul>03Banyek @Banyek all yours [01:34:53] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.774 second response time [01:35:53] PROBLEM - puppet last run on pc2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [01:38:02] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:53] RECOVERY - Check the NTP synchronisation status of timesyncd on pc2007 is OK: OK: synced at Wed 2018-10-31 01:39:51 UTC. [01:46:53] PROBLEM - puppet last run on pc2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [02:32:52] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:48:21] RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 1.01e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [02:51:01] RECOVERY - Maps - OSM synchronization lag - codfw on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 1.025e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [03:14:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:26:42] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:32:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 933.06 seconds [04:08:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 135.07 seconds [04:11:21] (03CR) 10Andrew Bogott: [C: 04-2] "Going to try to do this within the cloud instead" [puppet] - 10https://gerrit.wikimedia.org/r/470445 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [04:20:17] (03PS2) 10Andrew Bogott: ntp: use cloud-specific ntp servers for cloud VMS [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) [04:20:19] (03PS1) 10Andrew Bogott: Add role/profile for a set of in-cloud ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/470751 (https://phabricator.wikimedia.org/T208244) [04:21:13] (03CR) 10jerkins-bot: [V: 04-1] Add role/profile for a set of in-cloud ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/470751 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [04:23:37] (03PS2) 10Andrew Bogott: Add role/profile for a set of in-cloud ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/470751 (https://phabricator.wikimedia.org/T208244) [04:23:39] (03PS3) 10Andrew Bogott: ntp: use cloud-specific ntp servers for cloud VMS [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) [04:27:42] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.956 second response time [04:29:25] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Andrew) @Volans, I don't have timestamps, but I do have this from our weekly meeting alert summary: > > 2018-10-27: labvirt1014 puppet transient page > 2018-10-28: labvirt1017 puppet transien... [04:31:11] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:25] (03PS1) 10Elukey: mcrouter: switch codfw proxy mw2214 with mw2163 [puppet] - 10https://gerrit.wikimedia.org/r/470752 (https://phabricator.wikimedia.org/T208272) [06:33:59] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13268/" [puppet] - 10https://gerrit.wikimedia.org/r/470752 (https://phabricator.wikimedia.org/T208272) (owner: 10Elukey) [06:41:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:47:52] PROBLEM - Wikitech-static main page has content on labtestweb2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:12] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:48:52] RECOVERY - Wikitech-static main page has content on labtestweb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 33890 bytes in 0.237 second response time [06:49:12] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33890 bytes in 0.238 second response time [06:50:56] 10Operations, 10User-Elukey: mcrouter prometheus exporter stops working when mcrouter restarts - https://phabricator.wikimedia.org/T208375 (10elukey) p:05Triage>03Normal [06:53:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:54:08] 10Operations, 10User-Elukey: Upgrade memkeys to its latest upstream - https://phabricator.wikimedia.org/T208376 (10elukey) p:05Triage>03Normal [06:58:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:12:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 50 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:12:12] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:33:53] (03CR) 10Filippo Giunchedi: [C: 032] remove graphite and carbon-relay cnames [dns] - 10https://gerrit.wikimedia.org/r/470626 (owner: 10Cwhite) [07:33:58] (03PS3) 10Filippo Giunchedi: remove graphite and carbon-relay cnames [dns] - 10https://gerrit.wikimedia.org/r/470626 (owner: 10Cwhite) [07:37:05] (03CR) 10Filippo Giunchedi: "LGTM! Naming nit inline but other than that looks good to go" (031 comment) [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [07:41:01] (03PS1) 10Filippo Giunchedi: wmnet: reintroduce graphite.eqiad.wmnet only [dns] - 10https://gerrit.wikimedia.org/r/470755 [07:41:17] (03CR) 10Filippo Giunchedi: [C: 031] wmnet: reintroduce graphite.eqiad.wmnet only [dns] - 10https://gerrit.wikimedia.org/r/470755 (owner: 10Filippo Giunchedi) [07:41:41] (03CR) 10Filippo Giunchedi: [C: 032] wmnet: reintroduce graphite.eqiad.wmnet only [dns] - 10https://gerrit.wikimedia.org/r/470755 (owner: 10Filippo Giunchedi) [07:44:48] 10Operations, 10Continuous-Integration-Config, 10Jenkins: Ensure jenkins on puppet.git checks for yaml syntax errors - https://phabricator.wikimedia.org/T208240 (10ema) p:05Triage>03Normal [07:46:29] 10Operations, 10Patch-For-Review: stop using mod_php anywhere - https://phabricator.wikimedia.org/T208257 (10ema) p:05Triage>03Normal [07:48:16] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10ema) p:05Triage>03Normal [07:49:24] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Traffic: Increase EventLogging limit from 2K to 5K - https://phabricator.wikimedia.org/T208282 (10ema) p:05Triage>03Normal [07:50:11] 10Operations, 10Release-Engineering-Team, 10monitoring, 10Performance-Team (Radar), 10goodfirstbug: Increase "check_legal_html" coverage to group0 wikis - https://phabricator.wikimedia.org/T208284 (10ema) p:05Triage>03Normal [07:50:55] 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10ema) p:05Triage>03Normal [08:05:39] (03CR) 10Filippo Giunchedi: "LGTM modulo one value" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [08:07:54] (03PS1) 10Elukey: Release latest upstream [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/470773 (https://phabricator.wikimedia.org/T208376) [08:08:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:08:52] (03CR) 10Elukey: "Already built and tested in deployment-prep, looks good so far." [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/470773 (https://phabricator.wikimedia.org/T208376) (owner: 10Elukey) [08:11:22] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/470452 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [08:13:39] (03PS1) 10Elukey: geoip:archive.sh: avoid hardlinks [puppet] - 10https://gerrit.wikimedia.org/r/470778 [08:14:59] (03CR) 10Elukey: "Andrew/Fran: I am probably missing something about the hardlinks, let me know if this change is not good.." [puppet] - 10https://gerrit.wikimedia.org/r/470778 (owner: 10Elukey) [08:15:51] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:16:17] (03CR) 10Filippo Giunchedi: logstash: add generic kafka input config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [08:18:02] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational [08:22:27] (03PS1) 10Ema: admin: move sbassett to users [puppet] - 10https://gerrit.wikimedia.org/r/470779 (https://phabricator.wikimedia.org/T207852) [08:25:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10ema) Request approved during SRE meeting, 2018-10-29. [08:26:44] (03PS13) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [08:27:31] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [08:32:23] (03PS14) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [08:33:10] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [08:35:45] (03PS2) 10Ema: admin: move sbassett to users [puppet] - 10https://gerrit.wikimedia.org/r/470779 (https://phabricator.wikimedia.org/T207852) [08:35:47] (03PS1) 10Ema: admin: requested groups membership for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/470783 (https://phabricator.wikimedia.org/T207852) [08:36:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:39:46] !log start rolling out rsyslog 8.38 to stretch hosts [08:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:16] (03PS1) 10Ema: admin: add new user 'jdl' [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) [08:42:17] (03PS1) 10Ema: admin: groups membership for jdl [puppet] - 10https://gerrit.wikimedia.org/r/470785 (https://phabricator.wikimedia.org/T207951) [08:42:41] (03PS15) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [08:42:56] (03CR) 10jerkins-bot: [V: 04-1] admin: add new user 'jdl' [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [08:43:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10ema) Request approved during SRE meeting, 2018-10-29. [08:43:39] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [08:43:51] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:44:42] RECOVERY - puppet last run on pc2008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:49:25] (03PS16) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [08:49:52] (03PS1) 10Gehel: maps: increase alerting threshold on OSM replication lag [puppet] - 10https://gerrit.wikimedia.org/r/470787 [08:50:10] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [08:52:51] RECOVERY - puppet last run on pc2007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:56:35] (03PS17) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:00:55] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [09:04:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:04:41] RECOVERY - puppet last run on pc2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:06:11] RECOVERY - puppet last run on pc2010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:06:44] (03PS18) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:11:42] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:12:28] (03PS19) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:12:55] (03CR) 10Elukey: [C: 032] Release latest upstream [debs/memkeys] (debian) - 10https://gerrit.wikimedia.org/r/470773 (https://phabricator.wikimedia.org/T208376) (owner: 10Elukey) [09:14:25] Thanks a lot hashar for the merge :) [09:14:34] joal: does it work now? :) [09:14:45] hashar: testing now - seems ok - will confirm [09:15:09] Ah - actually no :( [09:15:11] hashar: [09:16:11] !log upload memkeys 20181031-1 to jessie-wikimedia thirdparty [09:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:47] hashar: From what I read in the job, the parameter I added to the xconfig is not used :( [09:18:35] (03PS20) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:23:32] PROBLEM - puppet last run on elastic2019 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog],Package[rsyslog-gnutls] [09:24:31] !log upgraded memkeys to 20181031-1 on all the mc* - T208376 [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:34] T208376: Upgrade memkeys to its latest upstream - https://phabricator.wikimedia.org/T208376 [09:24:52] 10Operations, 10Patch-For-Review, 10User-Elukey: Upgrade memkeys to its latest upstream - https://phabricator.wikimedia.org/T208376 (10elukey) 05Open>03Resolved [09:24:54] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [09:25:06] that elastic2019 failure is me [09:25:31] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [09:27:08] (03PS19) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [09:28:30] godog: just curious, why is it failing only on elastic2019? [09:29:32] gehel: by chance, not only on elastic2019, I forgot to add 'run-no-puppet' to my apt install invocation [09:30:26] godog: `run-no-puppet` what is it? [09:30:31] * gehel is going to learn something today [09:31:02] PROBLEM - DPKG on mwdebug2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:31:22] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/13274/" [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [09:31:56] (03PS2) 10Tim Eulitz: Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) [09:33:05] gehel: basically to wait for puppet to finish if it is running before running a command [09:34:20] (03PS3) 10Tim Eulitz: Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) [09:34:24] gehel: a wrapper to disable puppet, do stuff, enable puppet ;) [09:34:31] Oh, so it is a wrapper script! [09:34:35] the same we have in code in matt's review :D [09:34:47] Nice, I did not know that one, I do that manually each time I need it [09:34:52] that's why I want to move that to the puppet module once we'll have one :D [09:35:00] make sense [09:35:34] hashar: Second try :) [09:36:41] argh [09:36:41] RECOVERY - DPKG on mwdebug2001 is OK: All packages OK [09:38:29] joal: I have refreshed the job and build it manually ( https://integration.wikimedia.org/ci/job/analytics-refinery-release/145/console ) [09:38:51] RECOVERY - puppet last run on elastic2019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:39:25] (03PS21) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:40:52] joal: no luck. I am looking at the web interface configuration page [09:42:01] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog] [09:43:08] Hey, I just added a request for a SWAT deployment for the mid-day SWAT in a bit. Since it's the first time I am doing a SWAT, could someone double-check if I missed anything / if there's a problem with it? [09:43:14] https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_October_31 Link for the lazy [09:43:29] joal: maybe the setting has to be set everywhere? [09:43:36] (03PS1) 10Vgutierrez: certcentral: Stop abusing SELF_SIGNED status to signal errors [software/certcentral] - 10https://gerrit.wikimedia.org/r/470790 (https://phabricator.wikimedia.org/T208378) [09:45:56] joal: at least the build does: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djdk.net.URLClassPath.disableClassPathURLCheck=true [09:46:05] but then it says: Executing Maven: -B -f /srv/jenkins-workspace/workspace/analytics-refinery-release/pom.xml -s /tmp/settings7804593339835235621.xml clean package [09:46:08] which lack the option [09:46:16] so maybe it has to be set everywhere [09:46:37] (03PS22) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:47:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:50:43] hashar: I experienced the same issue with the previous setting (without having the parameter set, so it seems betrer) [09:51:06] hashar: meaning the Executing maven line didn't contain the parameter [09:51:10] (03PS23) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [09:51:13] :( [09:51:56] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [09:52:12] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:54:11] (03CR) 10Gabriel Birke: [C: 031] Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) (owner: 10Tim Eulitz) [09:54:51] joal: :((( [09:54:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:56:11] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:56:36] (03CR) 10Ladsgroup: [C: 032] Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) (owner: 10Tim Eulitz) [09:57:42] go go gadget Amir1 [09:57:46] (03Merged) 10jenkins-bot: Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) (owner: 10Tim Eulitz) [10:00:24] (03PS24) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [10:01:13] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [10:02:46] (03CR) 10jenkins-bot: Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) (owner: 10Tim Eulitz) [10:05:22] ^ rebased on deploy1001 [10:15:18] hashar: I have another solution to test that involves updating the jar - is there a way to easily setup the same java env that jenkins generates for me to test? [10:18:18] 10Operations, 10DBA: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) p:05Triage>03High [10:22:01] PROBLEM - DPKG on archiva1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:23:13] !log restarted pdfrender on scb1003 [10:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:21] RECOVERY - DPKG on archiva1001 is OK: All packages OK [10:24:31] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [10:24:47] joal: I think the maven job is just a fancy way to run maven. I am not sure there is much magic [10:25:03] joal: assuming you get the patched java 8 version, you should be able to reproduce the issue locally ? [10:25:26] hashar: I am on that track currently (java version) [10:25:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:25:49] hashar: as for maven, I really don't know how to pass it the param :( [10:26:00] :\ [10:26:04] trying a release at https://integration.wikimedia.org/ci/job/analytics-refinery-release/147/console [10:26:58] :( [10:27:58] hashar: you're working on making releases from Jenkins? Great! [10:29:24] 10Operations, 10DBA: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Banyek) a:03Banyek [10:29:30] hashar: gone for now, will work on that again this afternoon and keep you posted - Thanks fro the help :) [10:29:41] gehel: na Madhumitha did it a while ago [10:29:49] gehel: and solely for analytics/refinery [10:30:11] hashar: I should have a look and see if we can do the same for our projects [10:30:30] * gehel would feel much safer if releases were made by a robot and not by a human [10:30:35] gehel: be bold! The logic for refinery is in jjb/analytics.yaml . Surely that could be generalized to all maven repos [10:30:52] I'll get to it eventually... [10:31:16] joal: or maybe that is due to the maven version [10:32:50] hashar, joal: always use mvnwrapper and fix the version! [10:33:01] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 59 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:33:31] hashar: btw, where is the Dockerfile for that java image that we use? [10:34:23] 10Operations, 10DBA, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Banyek) [10:34:26] 10Operations, 10DBA, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) a:05Banyek>03None [10:36:07] 10Operations, 10DBA, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) a:03Banyek [10:36:39] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10elukey) There is a problem in the schedule I am afraid.. Nov 1st is holiday for most of the Europeans, plus I am a bit concerned about DBA presence since @Banyek and and M... [10:37:41] joal: AH so I have run the clean packages with mvn -X [10:38:02] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:38:11] and eventually it says surefire:test fails, he command was:/bin/sh -c cd /srv/jenkins-workspace/workspace/analytics-refinery-release/refinery-core && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /srv/jenkins-workspace/workspace/analytics-refinery-release/refinery-core/target/surefire/surefirebooter1119679690881654009.jar [10:38:11] /srv/jenkins-workspace/workspace/analytics-refinery-release/refinery-core/target/surefire/surefire5743705044851205592tmp /srv/jenkins-workspace/workspace/analytics-refinery-release/refinery-core/target/surefire/surefire_07400669994662315744tmp [10:38:39] sorry that is too long, the summary is the java command is not passed the magic setting [10:39:51] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:40:22] * hashar tries with maven 3.5 instead of 3.0 [10:40:42] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.039 second response time [10:43:25] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10Joe) >>! In T208272#4706141, @ayounsi wrote: > Here is the full list of hosts in that row. No outages expected, but brief (5s) connectivity interruption for some racks is... [10:45:07] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10Joe) To be clear: I think we should do the maintenance **without depooling anything** and check what would happen when we lose a row, even if in an inactive datacenter. Bu... [10:45:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 69 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:48:23] I will restart the CI Jenkins in a few minutes [10:51:49] godog: hey, tell me when you're around! thanks [10:53:03] Amir1: hey, sure I'm here [10:53:34] godog: so, logstash can't find ores logs or it's discarding them [10:53:47] what can we do? How we can check [10:54:32] https://logstash.wikimedia.org/goto/617aa6c0cedc953b704a2ed722c7078e [10:54:47] the INFO ones are coming from uwsgi and ores is not sending them [10:55:10] Amir1: ack, is there a task ? [10:55:17] there is five [10:55:26] I'll check ores1001 [10:56:06] !log restarting CI jenkins on contint1001 [10:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:15] Amir1: I looked now and it doesn't seem ores1001 is sending logstash.svc anything on port 12201 [10:57:16] godog: this would work for now: https://phabricator.wikimedia.org/T181630 [10:57:45] let me check the config [10:57:46] !log contint1001: upgraded java and restarted Jenkins [10:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:18] The config says: logstash.svc.eqiad.wmnet:12201 (is this right?) [10:59:21] depends what protocol/format you are using for sending, I'm assuming this is python-logstash ? [10:59:26] (03PS2) 10Elukey: Add change_tag to list of tables to sqoop [puppet] - 10https://gerrit.wikimedia.org/r/470593 (https://phabricator.wikimedia.org/T205940) (owner: 10Fdans) [10:59:35] I am also around Amir1 [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181031T1100). [11:00:04] Dereckson, Amir1, and Tim_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] godog: yup. It sends the JSON stuff [11:00:29] tim_WMDE: cool. Your patch should be live on beta by now (hopefully) I can double check [11:00:44] o/ [11:01:11] Dereckson, Amir1: you are deployers, right? go ahead and self-organize and deploy your patches :) [11:01:24] 10Operations, 10Continuous-Integration-Config: Ensure jenkins on puppet.git checks for yaml syntax errors - https://phabricator.wikimedia.org/T208240 (10hashar) [11:01:26] tim_WMDE: are you a deployer? or do you need help deploying the patch? [11:01:32] Amir1 looks like it is live, thanks [11:01:46] I am just around because I submitted something for SWAT deployment [11:02:04] zeljkof: I merged tim' [11:02:11] *Tim's patch already (beta) [11:02:18] Amir1: ah, cool, ok, then go ahead, I'm around if you need me :) [11:02:27] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470679 (https://phabricator.wikimedia.org/T205064) (owner: 10Ladsgroup) [11:03:30] Hello. Give me a greenlight and I can deploy it. It's a no op in prod. [11:03:38] Amir1: ok if that's "one json per line" then the port is 11514 [11:03:42] (03Merged) 10jenkins-bot: Do not load WikibaseQuality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470679 (https://phabricator.wikimedia.org/T205064) (owner: 10Ladsgroup) [11:03:45] 12201 is udp/gelf [11:04:14] also I'm assuming this is equally broken in beta [11:04:23] I think we are sending it over udp but json per line (let me double check) [11:05:27] testing the patch in mwdebug1002 [11:08:55] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:470679|Do not load WikibaseQuality (T205064)]] (duration: 01m 05s) [11:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:59] T205064: Undeploy WikibaseQuality extension from the WMF - https://phabricator.wikimedia.org/T205064 [11:10:50] (03CR) 10jenkins-bot: Do not load WikibaseQuality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470679 (https://phabricator.wikimedia.org/T205064) (owner: 10Ladsgroup) [11:11:56] Amir1: I'd recommend starting testing in beta and move to port 11514/udp, that should work and accept json over udp [11:12:23] 12201/udp is gelf which isn't implemented by logstash_handler.py afaics [11:21:14] oh okay [11:21:31] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10fgiunchedi) >>! In T208272#4708611, @Joe wrote: > we will most likely need to run switftrepl after the outage to catch up on missing originals. Should we failover traffic... [11:22:02] Dereckson: SWAT is yours [11:26:30] 10Operations, 10Continuous-Integration-Config: Ensure jenkins on puppet.git checks for yaml syntax errors - https://phabricator.wikimedia.org/T208240 (10hashar) PuppetSyntax has support to lint hiera files and the task should be run when a hiera file is changed. With that change, the rake task is registered (`... [11:28:29] thanks [11:30:19] jouncebot: next [11:30:20] In 0 hour(s) and 29 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181031T1200) [11:30:24] heh [11:30:38] bblack: we're currently in SWAT, do you need to add a change? [11:31:07] no, just keeping myself aware of concurrent things! :) [11:32:18] (03PS2) 10Dereckson: Find bash in environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470585 [11:33:33] (03CR) 10Dereckson: [C: 032] Find bash in environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470585 (owner: 10Dereckson) [11:34:48] (03Merged) 10jenkins-bot: Find bash in environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470585 (owner: 10Dereckson) [11:36:28] !log dereckson@deploy1001 Synchronized docroot/noc/createTxtFileSymlinks.sh: UNIX-agnostic shebang for createTxtFileSymlinks ([[Gerrit:470585]], no-op in prod) (duration: 00m 54s) [11:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:44] (03PS9) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [11:37:46] (03PS16) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [11:38:45] (03CR) 10jenkins-bot: Find bash in environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470585 (owner: 10Dereckson) [11:40:36] (03CR) 10Filippo Giunchedi: create rsyslog::ship_logfile - simplified logstash shipper via kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469945 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [11:41:06] godog: hmm, with 11514 in beta still can't see anything :/ [11:41:53] Amir1: what's the beta host you're trying on? [11:42:20] deployment-ores01 sending to deployment-logstash2.deployment-prep.eqiad.wmflabs:11514 [11:43:52] PROBLEM - High lag on wdqs1004 is CRITICAL: 3633 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:45:20] Amir1: are you root on that host? I'm checking if anything is sent with tcpdump -i any 'port 11514' [11:45:24] and doesn't look like it [11:46:09] let me double check [11:46:12] PROBLEM - High lag on wdqs1004 is CRITICAL: 3662 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:46:25] we need to restart the service to make it emit some logs [11:47:39] godog: it sends them when you restart the service: 11:47:10.950660 IP deployment-ores01.deployment-prep.eqiad.wmflabs.52533 > deployment-logstash2.deployment-prep.eqiad.wmflabs.11514: UDP, length 1063 [11:47:46] indeed now it is sending [11:49:53] godog: haha, they now show up in logstash but everything is INFO :/ [11:50:11] godog: Is this wrong? https://github.com/wikimedia/ores/blob/master/ores/logging/logstash_fomatter.py#L33 [11:50:39] maybe we send WARNING and it needs to get a number or something similar [11:50:42] PROBLEM - High lag on wdqs1005 is CRITICAL: 3662 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:50:58] Amir1: I don't know, why not use python-logstash though? [11:51:28] godog: it doesn't support python3 properly and also it's unmaintained (the whole thing is a basic copy paste though) [11:53:09] I see [11:53:53] something did get sent as warning though, e.g. [11:53:55] {"@version": "1", "host": "deployment-ores01", "message": "celery@deployment-ores01 ready.", "@timestamp": "2018-10-31T11:47:53.928143+00:00", "tags": [], "level": "WARNING", "path": [11:55:55] hmm, let me try something [11:56:51] (03PS5) 10BBlack: interface::rps: strict single CPU core per queue [puppet] - 10https://gerrit.wikimedia.org/r/468313 [11:56:53] (03PS7) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [11:56:55] (03PS7) 10BBlack: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [11:56:57] (03PS1) 10BBlack: remove numa device_to_cpumask_invert fact [puppet] - 10https://gerrit.wikimedia.org/r/470812 [11:56:59] (03PS1) 10BBlack: remove wdqs numa_networking hieradata [puppet] - 10https://gerrit.wikimedia.org/r/470813 [11:57:01] (03PS1) 10BBlack: tlsproxy: always NUMA, and looser CPU binding [puppet] - 10https://gerrit.wikimedia.org/r/470814 [11:57:03] (03PS1) 10BBlack: remove global numa_networking [puppet] - 10https://gerrit.wikimedia.org/r/470815 [11:57:09] ok, I have to go Amir1, ttyl [11:57:58] have fun! [11:58:29] (03PS17) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181031T1200) [12:08:55] 10Operations, 10Certcentral, 10DNS, 10Traffic: Allow Let's Encrypt issue wildcard certificates - https://phabricator.wikimedia.org/T208390 (10Vgutierrez) [12:09:40] 10Operations, 10Certcentral, 10DNS, 10Traffic: Allow Let's Encrypt issue wildcard certificates - https://phabricator.wikimedia.org/T208390 (10Vgutierrez) p:05Triage>03Normal [12:17:42] RECOVERY - High lag on wdqs1005 is OK: (C)3600 ge (W)1200 ge 273 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:21:39] (03PS1) 10BBlack: wikimedia.org CAA: allow wildcards for LE [dns] - 10https://gerrit.wikimedia.org/r/470816 (https://phabricator.wikimedia.org/T208390) [12:26:51] PROBLEM - High lag on wdqs1005 is CRITICAL: 4413 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:27:01] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, and 3 others: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 (10Ladsgroup) It needs some changes. I will make them and when you do them it just works: {F26999501} cc @fgiunchedi [12:29:46] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "What if we use the ensure_service() function to avoid the big if {} block?" [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) (owner: 10GTirloni) [12:30:38] (03PS18) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [12:31:26] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10Mathew.onipe) [12:32:20] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): refactor wdqs::updater to use scap::targets for sudo rules - https://phabricator.wikimedia.org/T208392 (10Mathew.onipe) p:05Triage>03Normal [12:33:13] (03PS2) 10Arturo Borrero Gonzalez: deployment-prep hieradata: Fix comment about which host this IP is [puppet] - 10https://gerrit.wikimedia.org/r/470095 (owner: 10Alex Monk) [12:34:15] (03CR) 10Arturo Borrero Gonzalez: [C: 032] deployment-prep hieradata: Fix comment about which host this IP is [puppet] - 10https://gerrit.wikimedia.org/r/470095 (owner: 10Alex Monk) [12:35:29] !log depooling wdqs1005 to catch up with others [12:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:02] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Fix Type constraints in wdqs (init.pp) - https://phabricator.wikimedia.org/T208393 (10Mathew.onipe) p:05Triage>03Normal [12:40:19] (03CR) 10Giuseppe Lavagetto: [C: 032] "Cherry-picked on beta, it works as expected. No significant changes happen. Apache gets restarted since we change the mode of the worker.l" [puppet] - 10https://gerrit.wikimedia.org/r/467644 (owner: 10Giuseppe Lavagetto) [12:40:39] (03CR) 10Giuseppe Lavagetto: [C: 032] "works correctly in beta." [puppet] - 10https://gerrit.wikimedia.org/r/467643 (owner: 10Giuseppe Lavagetto) [12:40:51] (03PS10) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [12:42:13] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Isolate wdqs service (blazegraph) as a submodule under the wdqs module - https://phabricator.wikimedia.org/T208394 (10Mathew.onipe) p:05Triage>03Normal [12:42:20] (03PS19) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [12:44:10] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Cleanup wdqs puppet profile to include the new changes based on refactoring - https://phabricator.wikimedia.org/T208395 (10Mathew.onipe) p:05Triage>03Normal [12:44:53] (03PS1) 10Gehel: wdqs: raise alerting threshold on updater lag for public cluster [puppet] - 10https://gerrit.wikimedia.org/r/470819 (https://phabricator.wikimedia.org/T199228) [12:50:40] (03PS3) 10GTirloni: tools-services: Add updatetools_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) [12:51:35] (03CR) 10jerkins-bot: [V: 04-1] tools-services: Add updatetools_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) (owner: 10GTirloni) [12:57:00] (03PS4) 10GTirloni: tools-services: Add updatetools_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) [12:57:56] (03CR) 10DCausse: [C: 031] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [12:58:00] (03CR) 10Ottomata: [C: 031] "IIRC, the hardlinks were just to avoid duplicates and save a bit of space, this should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/470778 (owner: 10Elukey) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181031T1300) [13:02:53] hmm. this rename seems to have lost it's original: https://commons.wikimedia.org/wiki/File:Lekeitioko_Mertxe_Pagoaga.webm [13:05:35] (03PS2) 10Elukey: geoip:archive.sh: avoid hardlinks [puppet] - 10https://gerrit.wikimedia.org/r/470778 [13:06:24] (03CR) 10Elukey: [C: 032] geoip:archive.sh: avoid hardlinks [puppet] - 10https://gerrit.wikimedia.org/r/470778 (owner: 10Elukey) [13:06:46] (03CR) 10Fdans: [C: 031] geoip:archive.sh: avoid hardlinks [puppet] - 10https://gerrit.wikimedia.org/r/470778 (owner: 10Elukey) [13:09:32] (03PS1) 10Ladsgroup: ores: Change logstash port from GELF to json lines [puppet] - 10https://gerrit.wikimedia.org/r/470827 (https://phabricator.wikimedia.org/T181546) [13:10:58] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10Eevans) >>! In T208272#4708611, @Joe wrote: >>>! In T208272#4706141, @ayounsi wrote: >> >> [ ... ] >> >> restbase2003 >> restbase2004 >> restbase2008 >> restbase2011 > >... [13:13:40] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I think we need the logic for the ensure inside service_params." [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) (owner: 10GTirloni) [13:17:04] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Finally after a lot of digging I added a meaningful graph to... [13:18:55] (03PS5) 10GTirloni: tools-services: Add updatetools_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) [13:19:50] (03CR) 10Hashar: [C: 031] "Verified. Should be good to go." [puppet] - 10https://gerrit.wikimedia.org/r/376739 (https://phabricator.wikimedia.org/T93414) (owner: 10Hashar) [13:22:20] (03PS2) 10Hashar: TXT entries for Github domain verification [dns] - 10https://gerrit.wikimedia.org/r/468279 (https://phabricator.wikimedia.org/T207364) [13:23:20] (03CR) 10Hashar: "They will then be verified on https://github.com/organizations/wikimedia/settings/domains" [dns] - 10https://gerrit.wikimedia.org/r/468279 (https://phabricator.wikimedia.org/T207364) (owner: 10Hashar) [13:25:43] (03PS1) 10Filippo Giunchedi: swift: enable statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470830 (https://phabricator.wikimedia.org/T205870) [13:26:26] (03CR) 10Filippo Giunchedi: [C: 032] swift: enable statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470830 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:26:36] (03PS2) 10Filippo Giunchedi: swift: enable statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470830 (https://phabricator.wikimedia.org/T205870) [13:31:52] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:31:52] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:17] sigh that's me [13:32:32] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:33:11] PROBLEM - puppet last run on ms-be1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:33:21] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:02] PROBLEM - puppet last run on ms-fe2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:11] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:11] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:12] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:12] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:12] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:22] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:31] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:32] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:32] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:39] (03CR) 10Effie Mouzeli: [C: 031] admin: move sbassett to users [puppet] - 10https://gerrit.wikimedia.org/r/470779 (https://phabricator.wikimedia.org/T207852) (owner: 10Ema) [13:34:52] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:52] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:52] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:52] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:02] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:02] PROBLEM - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:11] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:22] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:22] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:22] PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:23] godog: maybe quiet icinga-wm ? [13:35:31] PROBLEM - puppet last run on ms-be2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:31] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:32] PROBLEM - puppet last run on ms-fe1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:38] (03PS1) 10Filippo Giunchedi: hieradata: add mappings for swift/statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470832 [13:35:42] PROBLEM - puppet last run on ms-be2042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:50] Platonides: fix incoming [13:35:52] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:52] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:00] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add mappings for swift/statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470832 (owner: 10Filippo Giunchedi) [13:36:12] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:22] PROBLEM - puppet last run on ms-be2041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:31] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:31] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:31] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:32] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:35] good [13:36:42] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:51] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:51] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:51] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:51] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:11] PROBLEM - puppet last run on ms-fe2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:32] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:33] PROBLEM - puppet last run on ms-be1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:41] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:41] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:42] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:51] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:51] PROBLEM - puppet last run on ms-fe2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:52] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:52] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:52] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:02] PROBLEM - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:11] PROBLEM - puppet last run on ms-be2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:22] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:31] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:32] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:40] oof [13:38:41] PROBLEM - puppet last run on ms-fe1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:42] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:43] (03CR) 10Effie Mouzeli: [C: 031] admin: add new user 'jdl' [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [13:38:51] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:38:59] I’ll turn down ircecho for the time being [13:39:11] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:11] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:12] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:39:25] ok [13:39:41] (03PS2) 10Filippo Giunchedi: hieradata: add mappings for swift/statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470832 [13:39:45] !log temporarily stopping ircecho on einsteinium [13:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:05] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add mappings for swift/statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470832 (owner: 10Filippo Giunchedi) [13:41:36] (03PS3) 10Filippo Giunchedi: hieradata: add mappings for swift/statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/470832 [13:42:55] herron: thanks [13:43:39] godog: can I be of any help? [13:44:44] volans: it should be recovering now but thanks anyways [13:44:59] sorry, saw it just now [13:46:00] nah my bad for not running the compiler, too eager [13:51:05] (03PS25) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [13:52:05] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [13:53:01] (03PS1) 10Elukey: Move statistics::discovery to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/470837 (https://phabricator.wikimedia.org/T205846) [13:54:01] (03CR) 10Elukey: [C: 032] Move statistics::discovery to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/470837 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [13:57:42] (03PS1) 10Filippo Giunchedi: hieradata: fix statsd_exporter mappings for swift [puppet] - 10https://gerrit.wikimedia.org/r/470838 [13:58:16] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: fix statsd_exporter mappings for swift [puppet] - 10https://gerrit.wikimedia.org/r/470838 (owner: 10Filippo Giunchedi) [13:58:29] (03PS2) 10Filippo Giunchedi: hieradata: fix statsd_exporter mappings for swift [puppet] - 10https://gerrit.wikimedia.org/r/470838 [14:08:46] (03CR) 10Filippo Giunchedi: [C: 031] admin: requested groups membership for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/470783 (https://phabricator.wikimedia.org/T207852) (owner: 10Ema) [14:08:54] 10Operations, 10Release-Engineering-Team (Kanban): Migrate operations/puppet CI job from Jessie to Stretch - https://phabricator.wikimedia.org/T208422 (10hashar) [14:10:15] (03CR) 10Filippo Giunchedi: [C: 031] admin: move sbassett to users [puppet] - 10https://gerrit.wikimedia.org/r/470779 (https://phabricator.wikimedia.org/T207852) (owner: 10Ema) [14:11:07] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Migrate operations/puppet CI job from Jessie to Stretch - https://phabricator.wikimedia.org/T208422 (10hashar) a:03hashar [14:11:21] !log re-enabling ircecho on einsteinium [14:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:00] (03PS1) 10Elukey: Move ::statistics::wmde to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/470840 (https://phabricator.wikimedia.org/T205846) [14:13:13] (03CR) 10Filippo Giunchedi: [C: 031] admin: add new user 'jdl' [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [14:14:09] (03CR) 10Elukey: [C: 032] Move ::statistics::wmde to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/470840 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [14:17:31] !log Adding a Stretch based CI job for operations/puppet (non voting job for now) | T208422 [14:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:35] T208422: Migrate operations/puppet CI job from Jessie to Stretch - https://phabricator.wikimedia.org/T208422 [14:17:37] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Migrate operations/puppet CI job from Jessie to Stretch - https://phabricator.wikimedia.org/T208422 (10hashar) Deployed, the new job uses Stretch and is non voting until it is proven to be working properly :] [14:17:44] Testing dologmsg on mwmaint1002 [14:18:11] !log anomie Testing dologmsg on mwmaint1002 [14:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:27] (03PS3) 10Ema: admin: move sbassett to users [puppet] - 10https://gerrit.wikimedia.org/r/470779 (https://phabricator.wikimedia.org/T207852) [14:18:36] (03Abandoned) 10Hashar: git buildpackage configuration [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/461941 (owner: 10Hashar) [14:19:22] (03CR) 10Ema: [C: 032] admin: move sbassett to users [puppet] - 10https://gerrit.wikimedia.org/r/470779 (https://phabricator.wikimedia.org/T207852) (owner: 10Ema) [14:20:06] (03Abandoned) 10Hashar: Bump Jinja2 to 2.10+ [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399155 (owner: 10Hashar) [14:21:59] (03CR) 10Mathew.onipe: "Just one comment. But all looks good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470819 (https://phabricator.wikimedia.org/T199228) (owner: 10Gehel) [14:22:02] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/srv/analytics-wmde/graphite],File[/srv/analytics-wmde/wdcm] [14:22:41] !log anomie@mwmaint1002 Running migrateImageCommentTemp.php on group0 for T188132 [14:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:45] T188132: Merge image_comment_temp table into the image table - https://phabricator.wikimedia.org/T188132 [14:24:06] !log anomie@mwmaint1002 Running migrateComments.php on group0 for T166733 [14:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:10] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [14:25:02] (03Abandoned) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [14:25:20] (03CR) 10Mathew.onipe: [C: 031] maps: increase alerting threshold on OSM replication lag [puppet] - 10https://gerrit.wikimedia.org/r/470787 (owner: 10Gehel) [14:29:22] (03CR) 10Faidon Liambotis: "This is an improvement, so not voting it down, but it also seems like a good candidate for moving this to Hiera." [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [14:31:26] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) a:05Banyek>03Papaul @Papaul as I checked the storage on the hosts it's set up for with stripe size of 512Kb instead of 256K (https://wikitech.wi... [14:31:41] (03PS3) 10Herron: role::logstash::collector: migrate to profile::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/470452 (https://phabricator.wikimedia.org/T206454) [14:32:43] (03CR) 10Herron: [C: 032] role::logstash::collector: migrate to profile::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/470452 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:34:12] jynus or marostegui: Please let me know if the maintenance run I just started seems to cause any problems on s3. If all goes well, I'll be doing similar runs for the rest of the wikis (in parallel by section) soon-ish. Maybe tomorrow or Monday depending on how fast the current run completes. [14:34:35] 10Operations, 10Certcentral, 10DNS, 10Traffic, 10Patch-For-Review: Allow Let's Encrypt issue wildcard certificates - https://phabricator.wikimedia.org/T208390 (10Vgutierrez) [14:35:21] anomie: one sec [14:37:51] (03CR) 10Alex Monk: "I think that's complicated by the fact that standard::ntp gets included in the "standard" class itself, which is included from a *lot* of " [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [14:37:55] (03PS2) 10Ema: admin: requested groups membership for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/470783 (https://phabricator.wikimedia.org/T207852) [14:38:48] (03PS1) 10Vgutierrez: certcentral: Add pinkunicorn-wildcard certificate configuration [puppet] - 10https://gerrit.wikimedia.org/r/470846 (https://phabricator.wikimedia.org/T208424) [14:39:09] (03CR) 10Ema: [C: 032] admin: requested groups membership for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/470783 (https://phabricator.wikimedia.org/T207852) (owner: 10Ema) [14:41:44] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/13280/" [puppet] - 10https://gerrit.wikimedia.org/r/470846 (https://phabricator.wikimedia.org/T208424) (owner: 10Vgutierrez) [14:42:34] RECOVERY - High lag on wdqs1005 is OK: (C)3600 ge (W)1200 ge 170 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:43:10] !log movd wtp1034 eth0 to new switch...it was left over [14:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:45] RECOVERY - High lag on wdqs1004 is OK: (C)3600 ge (W)1200 ge 306 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:45:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10jcrespo) A larger stripe size should not be a huge issue (unlike a smaller one, which affected performance significantly and we didn't like it). We were thi... [14:47:15] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:47:21] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) @jcrespo actually i can change the stripe size on one of the hosts, and do some comparison, what do you think about this? [14:54:12] (03CR) 10Effie Mouzeli: [C: 031] admin: groups membership for jdl [puppet] - 10https://gerrit.wikimedia.org/r/470785 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [14:58:17] (03PS2) 10Ema: admin: add new user 'jdl' [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) [14:59:14] (03CR) 10Ema: [C: 032] admin: add new user 'jdl' [puppet] - 10https://gerrit.wikimedia.org/r/470784 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [15:02:38] (03PS2) 10Ema: admin: groups membership for jdl [puppet] - 10https://gerrit.wikimedia.org/r/470785 (https://phabricator.wikimedia.org/T207951) [15:03:22] (03CR) 10Ema: [C: 032] admin: groups membership for jdl [puppet] - 10https://gerrit.wikimedia.org/r/470785 (https://phabricator.wikimedia.org/T207951) (owner: 10Ema) [15:07:15] (03PS1) 10Cmjohnson: Adding mgmt dns for new servers p1007-10 [dns] - 10https://gerrit.wikimedia.org/r/470849 (https://phabricator.wikimedia.org/T207258) [15:12:03] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) [15:14:51] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) HP wanted me to reseat the sata cables which I did, and now all 10 disks are showing again but we're back to the original issue of the raid battery not fully charging. The amount of time and e... [15:16:08] (03PS1) 10Elukey: role::statistics::private: add deprecation motd to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/470850 (https://phabricator.wikimedia.org/T205846) [15:16:36] (03PS2) 10Elukey: role::statistics::private: add deprecation motd to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/470850 (https://phabricator.wikimedia.org/T205846) [15:17:11] (03CR) 10jerkins-bot: [V: 04-1] role::statistics::private: add deprecation motd to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/470850 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [15:18:22] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) @elukey the new disk arrived, I am happy to swap it whenever you're ready. it's the first disk on the server and you will need manually replace it in raid since it's SW raid. ping w... [15:18:34] (03CR) 10Elukey: [V: 032 C: 032] "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470850/" [puppet] - 10https://gerrit.wikimedia.org/r/470850 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [15:19:17] (03PS3) 10Cwhite: graphite: add queue_depth and batch_size options to carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) [15:19:18] My run of migrateComments.php on group0 finished now. [15:19:28] (03PS1) 10Banyek: mariadb: added pc2007 as parsercache host for shard pc1 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) [15:19:49] anomie: as usual, no concerns except the lagging it may create [15:20:03] jynus: Thanks for checking. [15:20:06] (03CR) 10jerkins-bot: [V: 04-1] mariadb: added pc2007 as parsercache host for shard pc1 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [15:20:22] anomie: also you have a tendency to run those just before friday/major holidays :-D [15:20:48] jynus: Is there a holiday coming up? [15:21:29] no issue, just a funny coincidence [15:22:17] just be around to calm down people if lag starts happening on codfw/dbstores/labsdbs [15:22:28] ;-) [15:23:13] (03PS1) 10Dmaza: Enable Partial Blocks on testwiki and testiwkidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470852 (https://phabricator.wikimedia.org/T203821) [15:23:14] all the work you are doing with DMLs is work you are saving that I will not have to do with DDLs [15:23:18] (03PS4) 10Cwhite: add socket_bufsize option to make SO_RCVBUF tunable [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) [15:23:46] anomie: one last thing, only partially related [15:24:24] we are going to start implementing db consistency checks/alerts, I may need your input in the future for some ideas [15:24:42] Ok, feel free to CC me. [15:25:52] cmjohnson1: I can try to shutdown aqs1006 in 5/10 mins [15:25:56] are you free? [15:26:01] Yes [15:26:08] Free enough ;-) [15:26:14] anomie: thank you [15:26:46] Elukey. Disk is hot swap. Shutdown is not necessary [15:27:41] cmjohnson1: ah ok, I only need to fail the disk via mdadm [15:27:46] Plus it’s the first disk. Most likely has grub installed on it. If i pull it out you may not get the OS to come back on reboot. [15:29:06] (03PS4) 10Cwhite: graphite: add queue_depth and batch_size options to carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) [15:29:59] (03CR) 10Cwhite: [C: 032] graphite: add queue_depth and batch_size options to carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [15:31:48] (03PS2) 10Gehel: wdqs: raise alerting threshold on updater lag for public cluster [puppet] - 10https://gerrit.wikimedia.org/r/470819 (https://phabricator.wikimedia.org/T199228) [15:31:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10ema) @jlinehan please try to SSH as jdl to one of the systems you should now have access to, for examp... [15:32:23] 10Operations, 10SRE-Access-Requests: Requesting access to Jupyter notebook / analytics-privatedata-users for jgleeson - https://phabricator.wikimedia.org/T208432 (10jgleeson) [15:32:45] (03CR) 10jerkins-bot: [V: 04-1] wdqs: raise alerting threshold on updater lag for public cluster [puppet] - 10https://gerrit.wikimedia.org/r/470819 (https://phabricator.wikimedia.org/T199228) (owner: 10Gehel) [15:32:56] (03PS3) 10Gehel: wdqs: raise alerting threshold on updater lag for public cluster [puppet] - 10https://gerrit.wikimedia.org/r/470819 (https://phabricator.wikimedia.org/T199228) [15:32:57] 10Operations: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 (10Joe) p:05Triage>03High [15:33:00] cmjohnson1: need to figure out what it is the partition to fail sorry, can we do it in say 2h? (meetings now) [15:33:01] (03CR) 10Gehel: wdqs: raise alerting threshold on updater lag for public cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470819 (https://phabricator.wikimedia.org/T199228) (owner: 10Gehel) [15:33:03] otherwise will do it now [15:33:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10ema) @sbassett please try to SSH as sbassett to one of the systems you should now have access to, for example s... [15:33:21] 10Operations, 10User-Joe: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 (10Joe) a:03Joe [15:33:27] let's plan for tomorrow morning (elukey) [15:33:37] ack [15:33:51] (03CR) 10Vgutierrez: [C: 032] wikimedia.org CAA: allow wildcards for LE [dns] - 10https://gerrit.wikimedia.org/r/470816 (https://phabricator.wikimedia.org/T208390) (owner: 10BBlack) [15:39:06] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:39:50] (03PS2) 10Cwhite: phabricator: remove custom diamond::collector [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) [15:40:01] cmjohnson1: sorry for the extra ping - just checked and it seems that mdadm already expelled the disk (/dev/sde) so I think that you are ready to swap if you have time now or later [15:40:15] the host is already depooled [15:40:31] I was confused by the output of /prod/mdstat [15:40:36] *proc [15:40:55] okay..yeah I can do it now [15:40:58] elukey ^ [15:41:02] super [15:41:48] done [15:41:52] (03CR) 10Cwhite: [C: 032] phabricator: remove custom diamond::collector [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:41:54] thanks! [15:42:23] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10Cmjohnson) [15:42:32] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) The disk has been swapped [15:45:56] (03PS2) 10Vgutierrez: certcentral: Add pinkunicorn-wildcard certificate configuration [puppet] - 10https://gerrit.wikimedia.org/r/470846 (https://phabricator.wikimedia.org/T208424) [15:46:04] (03CR) 10Vgutierrez: [C: 032] certcentral: Add pinkunicorn-wildcard certificate configuration [puppet] - 10https://gerrit.wikimedia.org/r/470846 (https://phabricator.wikimedia.org/T208424) (owner: 10Vgutierrez) [15:46:26] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 325 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:46:30] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1: huggle, mwstake, logging, mobile [puppet] - 10https://gerrit.wikimedia.org/r/470855 (https://phabricator.wikimedia.org/T204745) [15:47:30] (03PS2) 10Andrew Bogott: Horizon: move projects to eqiad1: huggle, mwstake, logging, mobile [puppet] - 10https://gerrit.wikimedia.org/r/470855 (https://phabricator.wikimedia.org/T204745) [15:48:53] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1: huggle, mwstake, logging, mobile [puppet] - 10https://gerrit.wikimedia.org/r/470855 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [15:50:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Just had a great meeting with @chasemp, @faidon, @JAllemandou and @nuria. The main action item (after Nuria h... [15:53:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10chasemp) My notes from the 2018-10-31 meeting: ```https://phabricator.wikimedia.org/T207321#4691776 * hosts that push... [15:57:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10sbassett) @ema Looks like I'm in (`sbassett@stat1007:~$`) Thanks. [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181031T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment and analytics-privatedata-users for sbassett - https://phabricator.wikimedia.org/T207852 (10ema) 05Open>03Resolved a:03ema Very well!