[00:00:05] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T0000). [00:37:27] (03CR) 10Faidon Liambotis: [C: 04-1] "I think this needs to be discussed further until we come to an agreement on how this would look like. Could you file a task about this or " [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (owner: 10Ayounsi) [00:40:03] (03PS1) 10Jforrester: composer: Ignore multiversion's vendor, too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507724 [00:40:05] (03PS1) 10Jforrester: env: Allow for running outside the cluster for local testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507725 [00:40:07] (03PS1) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 [00:40:09] (03PS1) 10Jforrester: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 [00:40:11] (03PS1) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 [00:40:13] (03PS1) 10Jforrester: [WIP] writeToStaticCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 [00:41:05] (03CR) 10jerkins-bot: [V: 04-1] [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [00:41:19] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 (owner: 10Jforrester) [00:41:42] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester) [00:41:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] writeToStaticCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (owner: 10Jforrester) [00:41:58] (03CR) 10Jforrester: [C: 04-2] "Here be dragons." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester) [00:43:29] (03CR) 10Dzahn: "i would have expected to see the same file with the real key in private/hieradata/common/profile/librenms.yaml but that is not the case (y" [labs/private] - 10https://gerrit.wikimedia.org/r/507715 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:44:38] (03CR) 10Dzahn: "also, is it really the public key or is it the private key that needs to be in private?" [labs/private] - 10https://gerrit.wikimedia.org/r/507715 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:45:58] (03CR) 10Dzahn: [C: 04-1] LibreNMS, file files permission, add app key, add logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:46:25] (03CR) 10Dzahn: [C: 04-1] LibreNMS, file files permission, add app key, add logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [00:48:23] (03PS3) 10Paladox: Update plugins to 2.15.13 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507521 [00:55:12] (03CR) 10Jforrester: "This is partially motivated by T220775 (having the diff on the static config files in the config repo forces explicit agreement that you m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (owner: 10Jforrester) [01:14:18] (03CR) 10Ayounsi: "> Patch Set 1:" [labs/private] - 10https://gerrit.wikimedia.org/r/507715 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [01:16:18] (03PS3) 10Ayounsi: LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) [01:30:55] (03CR) 10Ayounsi: LibreNMS, file files permission, add app key, add logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [02:44:18] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10Dzahn) [02:45:05] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10Dzahn) [02:46:18] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10Dzahn) [02:46:22] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T222326" [puppet] - 10https://gerrit.wikimedia.org/r/507634 (https://phabricator.wikimedia.org/T86552) (owner: 10Dzahn) [02:46:36] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T222326" [puppet] - 10https://gerrit.wikimedia.org/r/507642 (owner: 10Dzahn) [02:47:34] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552 (10Dzahn) Made a new ticket at T222326 describing our current issue with cron spam from this caused by a facter bug. [02:51:24] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Dzahn) If there is a consensus for this it's easy to do. It's just "rmlist engineering" on the server. By default the archives will be kept and stay public.. as they are. [02:51:45] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Dzahn) p:05Triage→03Normal [02:53:57] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Dzahn) 05Open→03Stalled Setting to 'stalled' as we are waiting for the answer to questions. [03:40:25] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:40:33] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:45:39] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:26] (03PS3) 10Andrew Bogott: wmcs: remove hiera references to the now-deleted main deploy [puppet] - 10https://gerrit.wikimedia.org/r/507509 [03:56:15] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: remove hiera references to the now-deleted main deploy [puppet] - 10https://gerrit.wikimedia.org/r/507509 (owner: 10Andrew Bogott) [04:05:19] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:06:47] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:12:11] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:15:20] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [04:15:21] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [04:15:21] !log kartik@deploy1001 scap-helm cxserver finished [04:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:41] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [04:16:43] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [04:16:43] !log kartik@deploy1001 scap-helm cxserver finished [04:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:49] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:18:31] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [04:18:32] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [04:18:32] !log kartik@deploy1001 scap-helm cxserver finished [04:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:35] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:22:29] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:23:31] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:23:46] !log apt-get upgrade on wikitech-static [04:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:49] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:24:51] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:25:08] !log Updated cxserver to 2019-05-02-040910-production (T222305) [04:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:13] T222305: CXServer alerting because it is requesting an old revision of a long page - https://phabricator.wikimedia.org/T222305 [04:25:27] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:26:09] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:26:45] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:27:45] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [04:28:37] !log upgraded mediawiki on wikitech-static to 1.32.1 [04:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:55] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:09:27] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:25:55] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [05:47:02] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) @CDanis please feel free, if you know exactly what needs to be changed, to modify all the necessary panels in the Kafka dashboard.. These g... [05:47:47] (03PS8) 10Elukey: admin: add the 'analytics' system user to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) [06:00:13] (03CR) 10Elukey: [C: 03+2] "This is only adding a system user to the privatedata group that does not hold any sudo permission, and Nuria approved this a while ago in " [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [06:30:27] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/jobrunner.svc.eqiad.wmnet.crt] [06:30:33] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [06:30:41] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/default/ferm] [06:31:07] PROBLEM - puppet last run on cloudvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:31:37] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] [06:31:37] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/ferm.conf] [06:33:37] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:56:59] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:01] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on cloudvirt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:07] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:08:59] PROBLEM - Check systemd state on ms-be1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:10:17] RECOVERY - Check systemd state on ms-be1014 is OK: OK - running: The system is fully operational [07:18:27] (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [07:21:08] (03PS1) 10Giuseppe Lavagetto: Allow proxyfetch to check more than one url at a time [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 [07:49:16] (03CR) 10Gehel: "minor comments inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [07:53:58] (03CR) 10Ema: [C: 03+1] "LGTM, a few suggestions." (035 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [08:07:55] (03CR) 10Gehel: "> Patch Set 1:" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [08:08:29] (03CR) 10Gehel: [C: 03+2] Add emacs ignores to gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/507370 (owner: 10CRusnov) [08:09:23] (03CR) 10jenkins-bot: Add emacs ignores to gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/507370 (owner: 10CRusnov) [08:14:59] (03PS1) 10Effie Mouzeli: Send 5% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507741 (https://phabricator.wikimedia.org/T219150) [08:20:06] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) I tried to set `instance_mode:cpu:rate5m{instance=~"$kafka_broker:.*",mode!="idle"}` for the cpu usage panel but I ended up in loading all... [08:22:32] (03CR) 10Volans: prometheus: add timeout paramter to query method (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [08:22:39] (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [08:22:55] (03PS2) 10Jcrespo: mariadb: Reenable notifications for db2098-db2101 [puppet] - 10https://gerrit.wikimedia.org/r/507407 (https://phabricator.wikimedia.org/T220572) [08:27:08] (03CR) 10Gehel: [C: 04-1] wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [08:28:54] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications for db2098-db2101 [puppet] - 10https://gerrit.wikimedia.org/r/507407 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [08:29:44] (03PS1) 10Ema: cache: reimage cp4023 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507744 (https://phabricator.wikimedia.org/T219967) [08:32:23] (03CR) 10Gehel: [C: 03+1] "LGTM, I'll wait a bit to see if volans has anything to add" [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [08:37:20] (03CR) 10Volans: "LGTM, just one question inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [08:41:23] (03CR) 10Filippo Giunchedi: "Great start! See first round of comments inline" (033 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:43:47] (03CR) 10Filippo Giunchedi: [C: 03+1] Send 5% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507741 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:47:44] (03CR) 10Effie Mouzeli: [C: 03+2] Send 5% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507741 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:48:25] (03PS1) 10Jcrespo: mariadb-backups: Set db2102 as a backup test host on codfw [puppet] - 10https://gerrit.wikimedia.org/r/507746 (https://phabricator.wikimedia.org/T220572) [08:48:48] (03Merged) 10jenkins-bot: Send 5% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507741 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:49:04] !log Sending more traffic to PHP7.2 - T219150 [08:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:08] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [08:51:22] (03CR) 10Volans: "Given that the bugfix .12 was released it's probably better to upgrade directly to that one." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507510 (owner: 10CRusnov) [08:52:18] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/16274/" [puppet] - 10https://gerrit.wikimedia.org/r/507746 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [08:54:57] (03CR) 10jenkins-bot: Send 5% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507741 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:55:21] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Send 5% of anonymous users to PHP7.2 - T219150 (duration: 01m 03s) [08:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:25] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [08:56:17] (03PS2) 10Gehel: profile::elasticsearch::cirrus: Don't duplicate udev stuff [puppet] - 10https://gerrit.wikimedia.org/r/503709 (owner: 10Alex Monk) [08:57:36] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10fgiunchedi) Indeed, facter upgrade task is {T219803}, cc @jbond FYI. I'm not sure what's the right answer is, probably ignoring stderr from `facter` until facter upgrade is complete and then add `-l error`. [08:58:31] (03CR) 10Gehel: [C: 03+2] profile::elasticsearch::cirrus: Don't duplicate udev stuff [puppet] - 10https://gerrit.wikimedia.org/r/503709 (owner: 10Alex Monk) [09:00:07] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Set db2102 as a backup test host on codfw [puppet] - 10https://gerrit.wikimedia.org/r/507746 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [09:00:17] (03PS2) 10Jcrespo: mariadb-backups: Set db2102 as a backup test host on codfw [puppet] - 10https://gerrit.wikimedia.org/r/507746 (https://phabricator.wikimedia.org/T220572) [09:02:39] !log rollout rsyslog upgrade 8.1901.0-1~bpo9+wmf1 to eqiad [09:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:48] !log depool cp4023 and reimage as upload_ats T219967 [09:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:52] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [09:03:42] (03PS2) 10Ema: cache: reimage cp4023 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507744 (https://phabricator.wikimedia.org/T219967) [09:04:43] (03CR) 10Ema: [C: 03+2] cache: reimage cp4023 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507744 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:06:23] PROBLEM - puppet last run on mw1332 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[rsyslog],Package[rsyslog-kafka],Package[rsyslog-gnutls] [09:07:01] !log reboot db2102 T220572 [09:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:06] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [09:07:46] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4023.ulsfo.wmnet'] ` The log can be... [09:08:16] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10fgiunchedi) >>! In T220907#5130265, @fgiunchedi wrote: >>>! In T220907#5120604, @Cmjohnson wrote: >> @fgiunchedi do you want to power off unplug and power on...that will clear the issue > > Yes please drain th... [09:08:40] rsyslog is me [09:09:09] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [09:10:18] will recover at next puppet run, unlucky interaction with debdeploy [09:11:20] RECOVERY - puppet last run on mw1332 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:14:00] (03PS1) 10Filippo Giunchedi: hieradata: labmon1002 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/507751 (https://phabricator.wikimedia.org/T187987) [09:14:10] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:15:12] (03CR) 10Volans: "I'm ok with the general approach, there are some bits missing though, see the comments inline." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (https://phabricator.wikimedia.org/T218440) (owner: 10TheAnarcat) [09:17:42] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: labmon1002 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/507751 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [09:23:13] (03PS2) 10Filippo Giunchedi: hieradata: labmon1002 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/507751 (https://phabricator.wikimedia.org/T187987) [09:24:01] (03CR) 10Elukey: "Thanks a lot for this! Did a very quick first pass, but it looks very promising." (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:34:22] (03PS1) 10Elukey: profile::hadoop::master: add alert for HDFS NN RCP queue length [puppet] - 10https://gerrit.wikimedia.org/r/507754 (https://phabricator.wikimedia.org/T220702) [09:34:38] 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) [09:35:21] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::master: add alert for HDFS NN RCP queue length [puppet] - 10https://gerrit.wikimedia.org/r/507754 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [09:38:22] (03PS1) 10Filippo Giunchedi: profile: add labmon1002 together with labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/507756 [09:39:12] (03PS2) 10Elukey: profile::hadoop::master: add alert for HDFS NN RCP queue length [puppet] - 10https://gerrit.wikimedia.org/r/507754 (https://phabricator.wikimedia.org/T220702) [09:41:57] !log testing backups on db2102 (increased network and disk usage) T220572 [09:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:02] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [09:43:04] (03PS4) 10Ema: Proxy Thumbor errors as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles) [09:44:21] (03CR) 10Ema: [C: 03+2] Proxy Thumbor errors as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles) [09:50:11] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4023.ulsfo.wmnet'] ` and were **ALL** successful. [09:54:31] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: add alert for HDFS NN RCP queue length [puppet] - 10https://gerrit.wikimedia.org/r/507754 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [09:54:38] (03PS3) 10Elukey: profile::hadoop::master: add alert for HDFS NN RCP queue length [puppet] - 10https://gerrit.wikimedia.org/r/507754 (https://phabricator.wikimedia.org/T220702) [09:58:58] (03PS2) 10Filippo Giunchedi: profile: add labmon1002 together with labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/507756 [10:03:58] !log pool cp4023 w/ ATS backend T219967 [10:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:03] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [10:04:42] (03PS1) 10Jbond: smart-data-dump: add '-l error' to facter command to suppress warnings [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) [10:08:42] (03PS2) 10ArielGlenn: generate index.html file for incr dumps once per pass over all wikis [dumps] - 10https://gerrit.wikimedia.org/r/505623 (https://phabricator.wikimedia.org/T221515) [10:11:12] (03CR) 10Volans: [C: 03+1] "LGTM as a temporary solution" [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) (owner: 10Jbond) [10:13:17] (03CR) 10Ema: [C: 03+1] "+1 facter bug happy times" [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) (owner: 10Jbond) [10:14:43] (03PS2) 10Jbond: smart-data-dump: add '-l error' to facter command to suppress warnings [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) [10:15:22] (03CR) 10Jbond: [C: 03+2] smart-data-dump: add '-l error' to facter command to suppress warnings [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) (owner: 10Jbond) [10:15:44] (03PS1) 10Jcrespo: mariadb: Prepare db1139 and db1140 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/507764 (https://phabricator.wikimedia.org/T218985) [10:19:38] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10jbond) [10:19:47] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [10:21:13] (03PS2) 10Jcrespo: mariadb: Prepare db1139 and db1140 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/507764 (https://phabricator.wikimedia.org/T218985) [10:24:30] 10Operations, 10ops-codfw, 10netbox: scs-a1-codfw: update serial in netbox - https://phabricator.wikimedia.org/T221984 (10Volans) [10:25:14] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) p:05Triage→03Normal [10:25:30] 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10jbond) i have added a plaster to the `smart-data-dump` to stop the spam and will investigate the underlining issues further via T222326 [10:27:50] 10Operations, 10netbox: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634 (10Volans) [10:28:15] 10Operations, 10netbox: netbox won't allow me to upload photos of the rack - https://phabricator.wikimedia.org/T209182 (10Volans) [10:28:27] 10Operations, 10Operations-Software-Development, 10netbox: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900 (10Volans) [10:29:15] (03PS3) 10ArielGlenn: generate index.html file for incr dumps once per pass over all wikis [dumps] - 10https://gerrit.wikimedia.org/r/505623 (https://phabricator.wikimedia.org/T221515) [10:29:24] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) [10:29:42] 10Operations, 10ops-ulsfo, 10netbox: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10Volans) [10:33:02] 10Operations, 10netbox: netbox: User's groups not updated - https://phabricator.wikimedia.org/T220004 (10Volans) It would be interesting to know if a logout/login was attempted to have Netbox refresh them. [10:33:20] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, 10User-crusnov: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10Volans) [10:33:30] 10Operations, 10Operations-Software-Development, 10netbox: Netbox: cable termination names report - https://phabricator.wikimedia.org/T216469 (10Volans) [10:33:56] (03CR) 10ArielGlenn: [C: 03+2] generate index.html file for incr dumps once per pass over all wikis [dumps] - 10https://gerrit.wikimedia.org/r/505623 (https://phabricator.wikimedia.org/T221515) (owner: 10ArielGlenn) [10:33:58] 10Operations, 10netbox, 10netops: Netbox switches consistency report - https://phabricator.wikimedia.org/T212878 (10Volans) [10:34:27] 10Operations, 10netbox, 10netops: Netbox should use CN rather than UID for LDAP login username - https://phabricator.wikimedia.org/T210566 (10Volans) [10:36:47] !log ariel@deploy1001 Started deploy [dumps/dumps@53c9f22]: speed up adds-changes dumps by generating index.html less often. tmep sleep 120 [10:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:03] !log ariel@deploy1001 Finished deploy [dumps/dumps@53c9f22]: speed up adds-changes dumps by generating index.html less often. tmep sleep 120 (duration: 00m 15s) [10:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:27] (03CR) 10Vgutierrez: [C: 03+1] "+1 overall, please address Ema's comments :)" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto) [10:39:12] 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10ArielGlenn) Where are these dumps being downloaded from, and by what means? (I may need to add someone else to this task to weigh in, depending on the answer.) [10:42:28] (03CR) 10Vgutierrez: [C: 03+2] Release 0.17 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/507026 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [10:44:58] (03CR) 10jenkins-bot: Release 0.17 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/507026 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez) [10:50:36] (03PS2) 10Arturo Borrero Gonzalez: openstack: drop labtest/labtestn unused code [puppet] - 10https://gerrit.wikimedia.org/r/506962 (https://phabricator.wikimedia.org/T218026) [10:53:44] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:04] MaxSem, RoanKattouw, and Niharika: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:04:46] (03CR) 10Volans: Netbox module for Spicerack (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [11:09:19] (03PS1) 10Arturo Borrero Gonzalez: labtestpuppetmaster: use a codfw1dev role [puppet] - 10https://gerrit.wikimedia.org/r/507767 (https://phabricator.wikimedia.org/T218026) [11:13:44] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: reallocate puppetmaster hiera keys from labtest [labs/private] - 10https://gerrit.wikimedia.org/r/507768 (https://phabricator.wikimedia.org/T218026) [11:14:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: openstack: reallocate puppetmaster hiera keys from labtest [labs/private] - 10https://gerrit.wikimedia.org/r/507768 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [11:19:24] (03PS2) 10Arturo Borrero Gonzalez: labtestpuppetmaster: use a codfw1dev role [puppet] - 10https://gerrit.wikimedia.org/r/507767 (https://phabricator.wikimedia.org/T218026) [11:19:56] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:20:23] (03CR) 10Jcrespo: [C: 03+2] mariadb: Prepare db1139 and db1140 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/507764 (https://phabricator.wikimedia.org/T218985) (owner: 10Jcrespo) [11:23:03] (03PS3) 10Arturo Borrero Gonzalez: labtestpuppetmaster: use a codfw1dev role [puppet] - 10https://gerrit.wikimedia.org/r/507767 (https://phabricator.wikimedia.org/T218026) [11:23:48] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add labmon1002 together with labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/507756 (owner: 10Filippo Giunchedi) [11:23:57] (03PS3) 10Filippo Giunchedi: profile: add labmon1002 together with labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/507756 [11:24:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestpuppetmaster: use a codfw1dev role [puppet] - 10https://gerrit.wikimedia.org/r/507767 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [11:25:14] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) Either dns, remote ipmi or password may not be configured properly: ` Error: Unable to establish IPMI v2 / RMCP+ session 11:23:36 | Unab... [11:36:57] (03PS3) 10Arturo Borrero Gonzalez: openstack: drop labtest/labtestn unused code [puppet] - 10https://gerrit.wikimedia.org/r/506962 (https://phabricator.wikimedia.org/T218026) [11:41:42] (03PS4) 10Arturo Borrero Gonzalez: openstack: drop labtest/labtestn unused code [puppet] - 10https://gerrit.wikimedia.org/r/506962 (https://phabricator.wikimedia.org/T218026) [11:46:40] (03PS5) 10Arturo Borrero Gonzalez: openstack: drop labtest/labtestn unused code [puppet] - 10https://gerrit.wikimedia.org/r/506962 (https://phabricator.wikimedia.org/T218026) [11:46:57] (03PS5) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) [11:47:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop labtest/labtestn unused code [puppet] - 10https://gerrit.wikimedia.org/r/506962 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [11:47:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Confirmed via PCC" [puppet] - 10https://gerrit.wikimedia.org/r/506962 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [11:49:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:49:34] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:53:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:54:52] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1200) [12:01:03] !log restart swift-proxy on ms-fe1005 T222071 [12:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:09] T222071: SwiftMedia URL rewrite returns some 404s with wrong Content-Length - https://phabricator.wikimedia.org/T222071 [12:02:12] PROBLEM - puppet last run on db1117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:03:14] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:08] !log swift-proxy rolling restart T222071 [12:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:13] T222071: SwiftMedia URL rewrite returns some 404s with wrong Content-Length - https://phabricator.wikimedia.org/T222071 [12:12:43] (03PS1) 10Arturo Borrero Gonzalez: mariadb: ferm_wmcs: update references to labtest [puppet] - 10https://gerrit.wikimedia.org/r/507770 (https://phabricator.wikimedia.org/T218026) [12:14:00] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:15] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1139.eqiad.wmnet', 'db1140.eqiad.wmne... [12:15:00] (03PS2) 10Arturo Borrero Gonzalez: mariadb: ferm_wmcs: update references to labtest [puppet] - 10https://gerrit.wikimedia.org/r/507770 (https://phabricator.wikimedia.org/T218026) [12:16:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC OK https://puppet-compiler.wmflabs.org/compiler1002/16285/" [puppet] - 10https://gerrit.wikimedia.org/r/507770 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [12:17:32] PROBLEM - puppet last run on db2078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:04] RECOVERY - puppet last run on db1117 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:18:38] ^^ that was me [12:19:08] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:19:18] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:22:50] RECOVERY - puppet last run on db2078 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:22:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) @Cmjohnson In case this is useful for you, I have documented how to enable ipmi on ilo5 from the web interface here: https://wikitech.wi... [12:23:22] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) [12:25:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: base: fullstack: fix hiera lookup for wrong key [puppet] - 10https://gerrit.wikimedia.org/r/507774 (https://phabricator.wikimedia.org/T218026) [12:27:45] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic, 10Patch-For-Review: SwiftMedia URL rewrite returns some 404s with wrong Content-Length - https://phabricator.wikimedia.org/T222071 (10ema) 05Open→03Resolved This is now fixed, CL matches the actual body length: ` $ curl -v http://swift-rw.disco... [12:28:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC OK https://puppet-compiler.wmflabs.org/compiler1002/16288/" [puppet] - 10https://gerrit.wikimedia.org/r/507774 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [12:29:17] PROBLEM - HHVM rendering on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:30:11] (03PS4) 10Filippo Giunchedi: profile: add labmon1002 together with labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/507756 [12:30:28] RECOVERY - HHVM rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 73841 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:31:04] 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) The use cas being run currently is actually the cirrus dumps to initialize cloudelastic servers. They are downloaded on mwmaint1002 with `curl -s https://dumps.wikimedia.org/oth... [12:33:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1140.eqiad.wmnet', 'db1139.eqiad.wmnet'] ` and were **ALL** successful. [12:35:40] 10Operations, 10media-storage: swift falsely claims 404s are gzipped - https://phabricator.wikimedia.org/T219635 (10fgiunchedi) 05Open→03Invalid Unreproducible/fixed ATM: ` $ curl --raw -v -H "Accept-Encoding: gzip" -H "Host: upload.wikimedia.org" https://swift-ro.discovery.wmnet/wikipedia/commons/thumb/b... [12:36:01] 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10ArielGlenn) Ok, so adding @Bstorm to weigh in about these limits or to bounce it to someone else on the WMCS team. There's an open ticket for that too: T191491 [12:36:10] (03PS1) 10Arturo Borrero Gonzalez: openstack: labtest: drop unused observerenv profile [puppet] - 10https://gerrit.wikimedia.org/r/507775 (https://phabricator.wikimedia.org/T218026) [12:36:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: labtest: drop unused observerenv profile [puppet] - 10https://gerrit.wikimedia.org/r/507775 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [12:37:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) [12:38:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Cmjohnson installed, implementation (provisioning) will be handled at T220572. [12:42:09] !log stopping several instances at dbstore1001 to clone them to db1139/40 T220572 [12:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:13] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [12:46:30] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: rename remaining labtest hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/507780 (https://phabricator.wikimedia.org/T218026) [12:50:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16290/" [puppet] - 10https://gerrit.wikimedia.org/r/507780 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [12:57:31] (03PS3) 10BBlack: Convert most DYNA into 1H CNAME records [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) [12:57:33] (03PS3) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) [12:57:50] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: cleanup labtest/labtestn hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/507783 (https://phabricator.wikimedia.org/T218026) [12:59:09] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: openstack: cleanup labtest/labtestn hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/507783 (https://phabricator.wikimedia.org/T218026) (owner: 10Arturo Borrero Gonzalez) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1300) [13:04:12] (03PS1) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 [13:06:54] (03PS2) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 [13:08:40] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [13:10:33] volans: the updated check works! :) [13:10:46] in this case there's no dependency cycle though [13:10:51] May 2 13:02:04 lvs3001 puppet-agent[10790]: Could not retrieve catalog from remote server: request https://puppet:8140/puppet/v3/catalog/lvs3001.esams.wmnet interrupted after 0.168 seconds [13:11:20] ema: interesting, also I was told that due to a change in behaviour in puppet, that would be the default error on puppet 6 [13:11:34] so we might want to adjust the message maybe [13:11:37] yeah [13:11:44] re-running puppet on lvs3001 meanwhile [13:11:46] but I would kept it different from the other generic one [13:11:55] to allow to distinguish them [13:11:57] (03PS3) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [13:13:58] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:15:54] 10Operations, 10ops-codfw, 10media-storage: ms-be2043 /dev/sdd drive failure - https://phabricator.wikimedia.org/T222362 (10CDanis) [13:17:24] (03PS4) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [13:17:29] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:18:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) The current iteration of the proposed broadly-applied production version is in PS3 of the patch @ https... [13:20:03] 10Operations, 10ops-codfw, 10media-storage: ms-be2043 /dev/sdd drive failure - https://phabricator.wikimedia.org/T222362 (10Volans) From `sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli --all` the only out of the ordinary data that I see is: ` Media Error Count: 2 ` Although: ` Other Error Count... [13:21:57] (03CR) 10Paladox: "I've tested this change locally on my mac and it works (correctly redirects and allows cloning again over /p/ on 2.16)" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:22:17] 10Operations, 10ops-codfw, 10media-storage: ms-be2043 /dev/sdd drive failure - https://phabricator.wikimedia.org/T222362 (10CDanis) I think the 'real' thing we need to notify on here is when Swift decides it wants to stop using a disk (which it did here) `May 1 19:01:01 ms-be2043 drive-audit: Errors found... [13:33:52] 10Operations, 10media-storage: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10CDanis) [13:41:19] (03Abandoned) 10Hashar: ImageFSM._is_published error handling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/486056 (https://phabricator.wikimedia.org/T214441) (owner: 10Hashar) [13:42:04] (03Abandoned) 10Hashar: hhvm: fix typo in RUN_AS_GROUP [puppet] - 10https://gerrit.wikimedia.org/r/474910 (https://phabricator.wikimedia.org/T209946) (owner: 10Hashar) [13:42:06] (03Abandoned) 10Hashar: hhvm: test default file generation [puppet] - 10https://gerrit.wikimedia.org/r/474917 (https://phabricator.wikimedia.org/T209946) (owner: 10Hashar) [13:42:49] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): hhvm systemd service on deployment-prep reports: hhvm.service: Ignoring invalid environment assignment 'RUN_AS_GROUP=www-data - https://phabricator.wikimedia.org/T209946 (10hashar) 05Open→03De... [14:04:48] (03CR) 10Ottomata: "BTW, I did a bunch of rdkafka metric exporting/matching in" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [14:11:30] (03PS1) 10Vgutierrez: config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) [14:11:32] (03PS1) 10Vgutierrez: dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 [14:11:34] (03PS1) 10Vgutierrez: CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) [14:11:36] (03PS1) 10Vgutierrez: acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) [14:11:38] (03PS1) 10Vgutierrez: Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) [14:12:26] (03PS2) 10Clarakosi: Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) [14:12:52] (03CR) 10Clarakosi: Add support for OpenAPI 3.0 (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi) [14:19:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Hm, sorry for this probably too late idea...but would it be worth building a C based prometheus... [14:24:58] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) labmon1002 has been migrated and seems to be working, I'll upgrade labmon1001 early next week. [14:27:40] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) This is due to a [[ https://phabricator.wikimedia.org/T222356 | bug in facter ]] fundamentally cased because there are an even number or words in the output of `ip route show`. I... [14:31:01] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) @Ottomata ahhhh you mean in varnishkafka itself! I thought that it would have needed a change in... [14:37:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) I think varnishkafka is already using this callback to write the stats out to the json file. I... [14:39:11] (03Abandoned) 10CRusnov: Update requirements and artifacts for Netbox v2.5.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507510 (owner: 10CRusnov) [14:39:49] (03PS1) 10CRusnov: Upgrade Netbox to 2.5.12 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507809 (https://phabricator.wikimedia.org/T222351) [14:40:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) But something would need to be created (a simple exporter) to read the json with the Prometheus m... [14:44:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10Cmjohnson) @Andrew the disk has been replaced, all yours to install [14:44:18] 10Operations, 10DBA, 10MediaWiki-Database, 10MediaWiki-Logging, and 5 others: Special:Log on commons -- entire web request took longer than 60 seconds and timed out - https://phabricator.wikimedia.org/T221458 (10WDoranWMF) 05Open→03Resolved a:03WDoranWMF [14:46:02] (03CR) 10CRusnov: "Tested on af-netbox" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507809 (https://phabricator.wikimedia.org/T222351) (owner: 10CRusnov) [14:46:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Aye yeah I guess there'd have to be some pull service, ya. Maybe converting whatever varnishka... [14:50:12] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [14:51:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10ema) >>! In T196066#5153029, @Ottomata wrote: > Hm, sorry for this probably too late idea...but would it... [14:57:20] (03PS1) 10Elukey: admin: allow analytics-admins to sudo as the analytics user [puppet] - 10https://gerrit.wikimedia.org/r/507812 (https://phabricator.wikimedia.org/T222368) [14:57:23] (03PS1) 10Elukey: admin: allow analytics-admins to use systemctl for all units [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) [15:00:57] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) >>! In T222166#5149390, @hashar wrote: > The workaround kind of make sense, however whenever we provision a new instance we wo... [15:02:04] (03CR) 10Ppchelko: [C: 03+1] Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi) [15:02:51] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Ya good point [15:02:54] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10ayounsi) The `lock` is needed to workaround a bug with the kernel/ipsec. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/437784 and https://phabricator.wikimedia.org/T195365 [15:03:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) Hm but also, whatever we replace varnishkafka with will likely be librdkafka based. Perhaps a l... [15:09:15] (03CR) 10Volans: "I don't see the change in src/ to point to the newer commit" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507809 (https://phabricator.wikimedia.org/T222351) (owner: 10CRusnov) [15:13:50] (03PS38) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) [15:15:09] !log add dsharpe to content admin on wikitech for user blocking [15:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:23] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [15:23:53] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [15:24:13] jouncebot: now [15:24:13] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [15:24:15] jouncebot: next [15:24:15] In 0 hour(s) and 35 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1600) [15:24:24] (03PS3) 10Reedy: Revert "Temporarily disable account creation on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507594 [15:24:46] (03CR) 10Reedy: [C: 03+2] Revert "Temporarily disable account creation on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507594 (owner: 10Reedy) [15:25:53] (03Merged) 10jenkins-bot: Revert "Temporarily disable account creation on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507594 (owner: 10Reedy) [15:26:09] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) >>! In T222356#5153203, @ayounsi wrote: > The `lock` is needed to workaround a bug with the kernel/ipsec. > See https://gerrit.wikimedia.org/r/c/operations/puppet/+/437784 and htt... [15:26:14] (03CR) 10jenkins-bot: Revert "Temporarily disable account creation on wikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507594 (owner: 10Reedy) [15:31:42] jouncebot: refresh [15:31:43] I refreshed my knowledge about deployments. [15:36:31] (03PS23) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [15:36:50] (03CR) 10Jbond: "Ready for another review" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [15:37:03] (03PS16) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) [15:37:30] (03PS3) 10Vgutierrez: prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) [15:37:43] (03PS2) 10Vgutierrez: nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594) [15:37:46] (03PS3) 10Reedy: Invalidate user sessions and log them out upon blocking on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507644 (https://phabricator.wikimedia.org/T222282) [15:37:57] (03PS5) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [15:38:00] (03PS4) 10Reedy: Invalidate user sessions and log them out upon blocking on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507644 (https://phabricator.wikimedia.org/T222282) [15:38:03] (03CR) 10Reedy: [C: 03+2] Invalidate user sessions and log them out upon blocking on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507644 (https://phabricator.wikimedia.org/T222282) (owner: 10Reedy) [15:38:20] (03CR) 10Paladox: "This is a safe redirect (/p/ is considered optional, and in fact upstream doin't recommend cloning over this url now)" [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [15:38:25] (03PS28) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [15:39:03] (03PS2) 10CRusnov: Upgrade Netbox to 2.5.12 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507809 (https://phabricator.wikimedia.org/T222351) [15:39:23] (03Merged) 10jenkins-bot: Invalidate user sessions and log them out upon blocking on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507644 (https://phabricator.wikimedia.org/T222282) (owner: 10Reedy) [15:40:17] (03PS4) 10Herron: rsyslog: add netdev_kafka_relay compatability endpoint [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T213899) [15:40:56] !log reedy@deploy1001 Synchronized wmf-config/wikitech.php: Invalidate user sessions upon blocking on wikitech (duration: 00m 59s) [15:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:57] (03CR) 10jenkins-bot: Invalidate user sessions and log them out upon blocking on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507644 (https://phabricator.wikimedia.org/T222282) (owner: 10Reedy) [15:42:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) Andrew is traveling this week, so I will handle the reimage. [15:42:35] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Re-enable account creation on wikitech (duration: 00m 57s) [15:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [15:46:35] (03PS4) 10BBlack: Convert most DYNA into 1H CNAME records [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) [15:46:36] (03PS4) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) [15:47:43] (03PS5) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [15:48:39] (03CR) 10jerkins-bot: [V: 04-1] Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [15:52:06] (03PS6) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [15:59:05] (03PS5) 10TheAnarcat: allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 [16:00:05] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1600). [16:00:05] James_F and paladox: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:36] (03CR) 10TheAnarcat: [C: 03+1] "fixed CI by disabling the "root check", also updated the comment - all done?" [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [16:02:08] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [16:02:27] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507809 (https://phabricator.wikimedia.org/T222351) (owner: 10CRusnov) [16:02:39] * paladox here [16:03:55] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Upgrade Netbox to 2.5.12 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507809 (https://phabricator.wikimedia.org/T222351) (owner: 10CRusnov) [16:05:37] (03PS7) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [16:06:49] (03PS2) 10Elukey: admin: allow analytics-admins to use systemctl for all units [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) [16:10:00] !log restarted dbproxy1005 haproxy, weird connection issue [16:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:06] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Nuria) Ping @CDanis, any updates on this? [16:11:17] (03CR) 10Jbond: [C: 03+1] "LGTM we can always exclude them from specific services if we see problems later with e.g. `! /usr/bin/systemctl * ssh`" [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey) [16:13:02] (03PS8) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [16:14:07] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [16:16:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] ` [16:17:34] (03CR) 10Volans: "Almost there, just a small nitpick." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [16:19:46] (03PS9) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [16:20:13] anyone going to do puppet swat? :) [16:21:17] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10Tgr) [16:22:02] paladox: yea [16:22:13] thanks :) [16:22:39] paladox: looking now.. one sec [16:22:54] (03CR) 10jerkins-bot: [V: 04-1] allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [16:24:03] ok :) [16:25:20] 10Operations, 10Core Platform Team Backlog, 10Reading-Infrastructure-Team-Backlog, 10Services, and 2 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) [16:25:48] (03PS3) 10Dzahn: mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [16:26:11] 10Operations, 10Core Platform Team Backlog, 10Maps, 10Reading-Infrastructure-Team-Backlog, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) [16:26:41] (03CR) 10Dzahn: [C: 03+2] mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [16:27:36] 10Operations, 10Mobile-Content-Service, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move mobile apps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10bearND) [16:27:47] (03PS2) 10Dzahn: mwgrep: Also find Gadgets-definition message [puppet] - 10https://gerrit.wikimedia.org/r/504991 (owner: 10Jforrester) [16:27:55] (03PS1) 10Elukey: role::analytics_cluster::coordinator: add system users [puppet] - 10https://gerrit.wikimedia.org/r/507827 (https://phabricator.wikimedia.org/T220971) [16:28:51] (03PS6) 10TheAnarcat: allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 [16:29:01] 10Operations, 10Core Platform Team Backlog, 10Maps, 10Reading-Infrastructure-Team-Backlog, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) [16:29:05] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10MSantos) [16:29:09] (03CR) 10Dzahn: [C: 03+2] mwgrep: Also find Gadgets-definition message [puppet] - 10https://gerrit.wikimedia.org/r/504991 (owner: 10Jforrester) [16:29:11] (03CR) 10TheAnarcat: [C: 03+1] "recheck" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [16:29:22] 10Operations, 10Core Platform Team Backlog, 10Maps, 10Reading-Infrastructure-Team-Backlog, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) p:05Triage→03Normal [16:31:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [16:31:43] James_F: mwgrep changes merged (swat) [16:32:56] (03PS3) 10Herron: logstash: add tcp json_lines localhost compatability endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496021 (https://phabricator.wikimedia.org/T213899) [16:33:07] (03PS10) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [16:35:12] 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10hashar) I have already upgraded Zuul on the production machines as well as the WMCS instances. So uploading to apt.wikimedia.org would be a noop :] [16:40:53] 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10RobH) p:05Triage→03Normal [16:41:22] (03CR) 10CRusnov: "PCC seems happy which makes me happy." [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [16:42:30] !log replaying 30 minutes of eqiad search traffic on codfw - T221121 [16:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:35] T221121: Capacity planning for elastic search - https://phabricator.wikimedia.org/T221121 [16:47:48] (03PS5) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [16:48:53] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [16:51:47] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) @robh this server is still showing up on the switch side ` papaul@asw-b-codfw> show interfaces ge-8/0/12 descriptions Interface Admin Link Description ge-... [16:52:06] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) a:05Papaul→03RobH [16:54:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] ` [16:54:57] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10RobH) a:05RobH→03Papaul Done, port disabled, back to you. [16:56:10] (03CR) 10Gehel: prometheus: add timeout paramter to query method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [16:56:52] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [16:58:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be... [17:00:04] cscott, arlolra, subbu, and halfak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1700). [17:02:13] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Pablo-WMDE) @mobrovac Thanks for the feedback. If it is possible at all we would really appreciate if you could link us to the... [17:10:54] (03PS2) 10Elukey: role::analytics_cluster::coordinator: add system users [puppet] - 10https://gerrit.wikimedia.org/r/507827 (https://phabricator.wikimedia.org/T220971) [17:11:16] (03PS3) 10Elukey: role::analytics_cluster::coordinator: add system users [puppet] - 10https://gerrit.wikimedia.org/r/507827 (https://phabricator.wikimedia.org/T220971) [17:14:31] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable SpecialHomepage feature for cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507114 (https://phabricator.wikimedia.org/T221266) [17:16:54] RECOVERY - Memory correctable errors -EDAC- on kafka1023 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad+prometheus/ops [17:18:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) I had some issues with reimaging because the drive replacement. LVM cound't find the old disk UUID (obviously) and I had to force things by... [17:18:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] ` [17:23:24] (03CR) 10Volans: [C: 03+2] "Thanks for your contribution!" [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [17:25:20] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: add system users [puppet] - 10https://gerrit.wikimedia.org/r/507827 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [17:32:06] (03Merged) 10jenkins-bot: allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [17:32:49] secteam plans to deploy a patch for T222324 now. Let us know if there are any objections. [17:33:17] (03CR) 10jenkins-bot: allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [17:36:30] (03CR) 10Dzahn: Add a check_netbox_report icinga check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:37:49] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10CDanis) I got tied up with goal work and incident response and have only had a little time to spend on this. The client that @Ottomata found does look like a good on... [17:38:00] (03PS1) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [17:38:53] (03CR) 10jerkins-bot: [V: 04-1] WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 (owner: 10Herron) [17:39:45] !log arlolra@deploy1001 Started deploy [parsoid/deploy@414387b]: Updating Parsoid to 9786781 [17:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:41] (03PS3) 10Andrew Bogott: Allow puppet-merge to merge the labs/private repo [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) [17:42:48] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [17:43:22] (03PS2) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [17:43:47] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) a:05Cmjohnson→03RobH @robh all the servers are racked and on-site work has been completed. Some are off and some are in a state that just needs... [17:43:49] (03CR) 10Dzahn: [C: 03+1] "nice! thanks for the follow-up on https://gerrit.wikimedia.org/r/c/operations/puppet/+/507634 this was really the only way to fix it for n" [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) (owner: 10Jbond) [17:44:08] (03CR) 10jerkins-bot: [V: 04-1] WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 (owner: 10Herron) [17:44:36] (03CR) 10Dzahn: [V: 03+1] "compiles fine: https://puppet-compiler.wmflabs.org/compiler1002/16299/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [17:45:30] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@414387b]: Updating Parsoid to 9786781 (duration: 05m 45s) [17:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:35] (03CR) 10Andrew Bogott: Allow puppet-merge to merge the labs/private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott) [17:46:52] !log Deployed patch for T222324 (1.34.0-wmf.3) [17:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:54] (03PS3) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [17:50:30] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/507763 (https://phabricator.wikimedia.org/T86552) (owner: 10Jbond) [17:52:04] (03CR) 10jerkins-bot: [V: 04-1] WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 (owner: 10Herron) [17:55:00] (03PS4) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [17:55:44] 10Operations, 10ops-codfw, 10decommission: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) @Robh thanks [17:57:18] mutante: Oh, thanks! [17:58:23] :) [17:59:58] 10Operations, 10Core Platform Team Backlog, 10Maps, 10Reading-Infrastructure-Team-Backlog, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10Pchelolo) [18:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1800). Please do the needful. [18:00:05] ottomata and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:47] here [18:00:50] I can SWAT [18:01:55] Thanks RoanKattouw; I've got a security patch to regularise once you're done, if you could ping me? [18:02:04] Yes will ping you when donie [18:02:06] *done [18:02:12] (03CR) 10Volans: "I've a leftover comment while waiting for the next CR :)" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [18:02:18] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable SpecialHomepage feature for cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507114 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [18:02:35] (03PS1) 10Paladox: Gerrit: increase sendemail threads to 2 [puppet] - 10https://gerrit.wikimedia.org/r/507852 [18:03:17] (03PS2) 10Paladox: Gerrit: increase sendemail threads to 2 [puppet] - 10https://gerrit.wikimedia.org/r/507852 [18:03:19] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "lgtm for the puppet part, just first run puppet on netbox*, then on icinga* ..because of exported resources and for good measure check 'ic" [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [18:03:22] (03Merged) 10jenkins-bot: GrowthExperiments: Enable SpecialHomepage feature for cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507114 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [18:03:43] Also presumably the WMDE lot will want https://gerrit.wikimedia.org/r/507847 deployed as a train un-blocker. [18:03:47] (03CR) 10jenkins-bot: GrowthExperiments: Enable SpecialHomepage feature for cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507114 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [18:03:49] i'm here [18:03:57] (03PS1) 10Paladox: Gerrit: Decrease sendemail timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/507853 [18:04:09] RoanKattouw: my change should be a noop ungtil wmf.3 goes out later [18:04:12] you can deploy that at will [18:04:33] (03PS2) 10Paladox: Gerrit: Decrease sendemail timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/507853 [18:05:48] kostajh: Config patch is on mwdebug1002, pleas test [18:05:55] RoanKattouw: on it [18:09:15] !log phab1001 - install package upgrades for bash and cron [18:09:17] RoanKattouw: I can enable the prefs on cs/kowiki, tutorial module changes state when visiting the title, and mentors assigned to me in both wikis. I think we're OK to proceed. [18:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:45] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10Pchelolo) [18:10:25] (03PS2) 10Catrope: Enable cirrussearch-request logging to eventgate-analytics on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507709 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:10:44] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable SpecialHomepage on cswiki and kowiki (T221266) (duration: 00m 58s) [18:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:48] T221266: Homepage: Deploy to target wikis in production - https://phabricator.wikimedia.org/T221266 [18:11:15] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10CDanis) I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will cre... [18:11:44] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10CDanis) Also sorry, I don't have a lot of time left over this week; can take a deeper look next week [18:14:02] (03CR) 10Catrope: [C: 03+2] Enable cirrussearch-request logging to eventgate-analytics on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507709 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:15:01] (03Merged) 10jenkins-bot: Enable cirrussearch-request logging to eventgate-analytics on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507709 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:16:16] (03CR) 10jenkins-bot: Enable cirrussearch-request logging to eventgate-analytics on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507709 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:18:31] (03PS11) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [18:18:33] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable cirrussearch-request logging to eventgate-analytics on all wikis (T214080) (duration: 00m 58s) [18:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:37] T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 [18:19:26] (03CR) 10jerkins-bot: [V: 04-1] Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov) [18:21:25] thanks! [18:22:12] (03PS12) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) [18:30:10] (03PS1) 10Thcipriani: gerrit: bump heap limit [puppet] - 10https://gerrit.wikimedia.org/r/507858 (https://phabricator.wikimedia.org/T221026) [18:33:55] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/GrowthExperiments/: Don't fatal on deleted pages in 'recent questions' (T222206) (duration: 01m 01s) [18:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:59] T222206: Homepage: If a talk page containing a question is deleted, it causes fatal error for all newbies - https://phabricator.wikimedia.org/T222206 [18:37:11] (03PS2) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [18:37:19] (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (035 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [18:38:02] (03CR) 10CDanis: [C: 03+2] gerrit: bump heap limit [puppet] - 10https://gerrit.wikimedia.org/r/507858 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [18:38:16] (03PS4) 10CDanis: gerrit: reduce sshd.MaxConnectionsPerUser 32 -> 4 [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [18:38:55] (03CR) 10CDanis: [C: 03+2] gerrit: reduce sshd.MaxConnectionsPerUser 32 -> 4 [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar) [18:44:58] (03PS1) 10Ottomata: Reset offsets to earliest for mediawiki-avro camus job [puppet] - 10https://gerrit.wikimedia.org/r/507862 [18:45:49] (03CR) 10Ottomata: [C: 03+2] Reset offsets to earliest for mediawiki-avro camus job [puppet] - 10https://gerrit.wikimedia.org/r/507862 (owner: 10Ottomata) [18:56:07] 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10EBernhardson) The particular limit we are running into is part of the nginx config in `modules/dumps/templates/web/xmldumps/nginx.conf.erb' which specifies `limit_rate 2048k` [18:57:48] 10Operations, 10Gerrit, 10Release-Engineering-Team: Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10thcipriani) [18:58:57] I am working with dbstore1001 [18:59:11] it may complaing about dpkg temporarilly [18:59:24] actually [18:59:59] !log restart dbstore1001 for upgrade [19:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] thcipriani: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T1900). [19:00:48] !log phab1001 - upgrading PHP packages on prod phab server [19:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:28] train time [19:02:33] yeehaw [19:03:00] thcipriani: will be watching this one, it will turn thee new search request logging on for all wikis [19:03:15] should be ok since it is happening fine on group1, and there's already more volume of these logs for api-request [19:03:20] but will be watching! [19:03:21] !log phab2001 - apt-get autoremove ..removes a single python package not needed anymore [19:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:30] (03CR) 10Cwhite: "> Patch Set 1:" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:08:07] James_F: looks like you +2d my last blocker. I started to type "just +2'd" but that is inaccurate it seems. [19:08:21] 10Operations, 10Gerrit, 10Release-Engineering-Team: Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Paladox) Wanted to note that gerrit2001 has 64gb of ram, so this increase would match it so that we have the same ram specs in both data centres. [19:12:44] (03CR) 10Ottomata: "> * In most of the metrics, I see "eventgate" at the beginning of the metric name and a "service" label which appear to be the same value." [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:13:20] James_F: except it has a failing test it seems, ugh [19:13:29] (03CR) 10Ottomata: "> "Event uses two Kafka producers..."" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:17:23] thcipriani: Yeah, gremlins everywhere. [19:17:46] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:17:49] so it would seem [19:18:00] thcipriani: The stuff I'm landing shouldn't block the train (the UBN in Wikibase is already live because that's on wmf.3). [19:18:18] thcipriani: So just sling the train out now? [19:18:48] fair enough [19:18:50] sure [19:19:28] an-coord alert is me manulaly runnign a job [19:19:30] * thcipriani gets ducks in a row [19:19:42] i guess the proc exits non 0 if it can't launch? [19:19:42] hm [19:20:46] (03CR) 10Cwhite: "> Patch Set 2:" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:21:49] (03PS1) 10Thcipriani: all wikis to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507864 [19:21:51] (03CR) 10Thcipriani: [C: 03+2] all wikis to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507864 (owner: 10Thcipriani) [19:23:00] (03CR) 10Ottomata: "Hm, yeah tricky because it also would be nice to keep the names close to what librdkafka calls them. That makes automatically matching th" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [19:23:03] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507864 (owner: 10Thcipriani) [19:23:17] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507864 (owner: 10Thcipriani) [19:27:04] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3 [19:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:24] train slung [19:29:08] CC ottomata for search request logging volume. :-) [19:30:15] i see it! [19:31:23] !log Shuffled 1.34.0-wmf.3 security patch cee0e569f4 for T222324 into the tip of the upstream branch now it's merged; no-op [19:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:52] (03PS2) 10Jbond: prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 [19:32:37] (03CR) 10Jbond: prometheus: add timeout parameter to query method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [19:33:10] Not to jinx anything, but right now the only things I see in the top 50 hits in fatalmonitor beyond the fix for T222347 I'm landing now are timeouts. [19:33:10] T222347: wbsearchentities now returns an error with type=lexeme - https://phabricator.wikimedia.org/T222347 [19:33:21] Which makes for a difference from normal. [19:34:39] cautiously optimistic [19:38:00] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:39:44] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.3/includes/widget/SearchInputWidget.php: Hot-deploy T222329 fix part 1 (duration: 00m 53s) [19:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:48] T222329: Special:Search generates "TypeError: this.pushPending is not a function" - https://phabricator.wikimedia.org/T222329 [19:40:27] exceptions and fatals graph seems to indicate the 60 second timeouts [19:40:54] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.3/resources/src/mediawiki.widgets/mw.widgets.SearchInputWidget.js: Hot-deploy T222329 fix part 2 (duration: 00m 50s) [19:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:58] I'm backing away from the WBLCS patch cherry-pick as it's failing. [19:47:08] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [19:47:10] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:50:14] James_F: I'm going to move that blocker forward so it doesn't drop off the radar. [19:51:20] (03PS1) 10Jcrespo: mariadb: Setup db1139 and db1140 as the new eqiad backup sources [puppet] - 10https://gerrit.wikimedia.org/r/507867 (https://phabricator.wikimedia.org/T220572) [19:51:33] thcipriani: +1 [19:53:03] (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup db1139 and db1140 as the new eqiad backup sources [puppet] - 10https://gerrit.wikimedia.org/r/507867 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [19:53:12] 10Operations, 10Gerrit, 10Release-Engineering-Team: Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) So.. cobalt is already on a list of [[ T217764 | machines will be over 5 years old during FY19-20 ]] -> T217764#5005267 which was compiled to determine the number of needed (misc)... [19:54:25] 10Operations, 10ops-eqiad, 10Gerrit, 10Release-Engineering-Team, 10serviceops: Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) [20:11:04] James_F: is T222229 a thing I should be rolling back for? or just a thing that needs to be backported asap? [20:11:04] T222229: [Regression wmf.3] Cannot save edit after switching to the source editor from mobile VE if no other edits are made on source editor mode - https://phabricator.wikimedia.org/T222229 [20:12:55] thcipriani: The latter. :-( [20:13:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:13:33] thcipriani: (And I can't fix it because I don't have the peculiar version of node that the Web team demands.) [20:13:38] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:14:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:16:36] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:17:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:17:26] we served a bunch of 503s in the minute of 20:10, not sure why [20:17:33] it does look to have recovered though [20:17:42] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:18:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:18:49] mediawiki-errors and fatal-monitor don't really correlate with the event [20:19:12] hm [20:19:49] James_F: where do web folks like to roam these days? (so many channels). Any chance jdlrobson knows the magic for https://gerrit.wikimedia.org/r/507869/ ? (/me picks MobileFrontEnd contributor at random :)) [20:20:13] thcipriani: Somewhere not on IRC, I believe. [20:20:45] I'm busy trying to write a homebrew package for node 6.11.0 but it's hampered by node 6 being EOL and deleted from everywhere. [20:20:50] so many mediums for so many channels [20:21:38] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:21:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:22:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:22:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:50:49] (03PS5) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846 [20:57:40] !log crusnov@deploy1001 Started deploy [netbox/deploy@bf9aef2]: Upgrade Netbox to 2.5.12 - T222351 [20:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:44] T222351: Netbox: upgrade to v2.5.12 - https://phabricator.wikimedia.org/T222351 [20:58:02] James_F: looks like niedzielski regenerated resources for https://gerrit.wikimedia.org/r/507869 good to go now? [20:58:14] !log crusnov@deploy1001 Finished deploy [netbox/deploy@bf9aef2]: Upgrade Netbox to 2.5.12 - T222351 (duration: 00m 33s) [20:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:19] !log crusnov@deploy1001 Started deploy [netbox/deploy@bf9aef2]: Upgrade Netbox to 2.5.12 - T222351 [20:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:07] !log crusnov@deploy1001 Finished deploy [netbox/deploy@bf9aef2]: Upgrade Netbox to 2.5.12 - T222351 (duration: 01m 48s) [21:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:20] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [21:02:20] yeah, that's rebuilt. i think it's ok to go from my end but you should verify the specific fix [21:02:59] wrt node, we just use node version manager (https://github.com/nvm-sh/nvm) until our CI jobs support a modern version. hopefully not an issue for long [21:03:48] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [21:03:58] niedzielski: gotcha, thank you for the rebuild! Appreciated. [21:04:37] 👍 [21:08:39] we also have an RFC that solves this problem. if you are interested, please comment on T199004 [21:08:48] T199004: RFC: Add a frontend build step to skins/extensions to our deploy process - https://phabricator.wikimedia.org/T199004 [21:16:06] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [21:16:36] PROBLEM - puppet last run on lvs5002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [21:16:50] * thcipriani restarts gerrit between deploy windows [21:19:43] !log gerrit restart to pick up config changes: https://gerrit.wikimedia.org/r/504973/ and https://gerrit.wikimedia.org/r/507858/ [21:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:06] !log gerrit back [21:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:23] (03PS1) 10QChris: Add .gitreview [software/varnish/libvmod-uuid] - 10https://gerrit.wikimedia.org/r/507888 [21:30:25] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/varnish/libvmod-uuid] - 10https://gerrit.wikimedia.org/r/507888 (owner: 10QChris) [21:32:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:33:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:34:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:35:18] thcipriani: Yeh, merging now. [21:36:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:39:22] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:41:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:43:08] RECOVERY - puppet last run on lvs5002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:49:16] 10Operations, 10ops-codfw, 10Reading-Infrastructure-Team-Backlog, 10decommission: Decommission maps-test cluster - https://phabricator.wikimedia.org/T202898 (10Papaul) [22:09:09] 10Operations, 10ops-eqiad, 10netops: Replace eqiad mgmt switches with EX4200s - https://phabricator.wikimedia.org/T213128 (10ayounsi) 05Open→03Declined Going with option 2. Will open tasks in the next FY when it's time to order them. [22:10:54] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/MobileFrontend/resources/dist/mobile.editor.overlay.js: Hot-deploy T222229 to fix VE switching on MobileFrontend (duration: 00m 52s) [22:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:58] T222229: [Regression wmf.3] Cannot save edit after switching to the source editor from mobile VE if no other edits are made on source editor mode - https://phabricator.wikimedia.org/T222229 [22:13:15] <3 James_F [22:14:19] thcipriani: T220728 should now be closable. [22:14:20] T220728: 1.34.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T220728 [22:14:36] And ha, thanks to niedzielski for the node futzing and ryasmeen for highlighting the need. [22:14:55] done! [22:15:07] Whee. [22:25:52] 10Operations, 10netbox: Netbox should use CN rather than UID for LDAP login username - https://phabricator.wikimedia.org/T210566 (10ayounsi) [22:41:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380 (10ayounsi) [22:41:35] 10Operations, 10Traffic, 10netops: Investigate lvs IP pages during codfw row C switch upgrade - https://phabricator.wikimedia.org/T171032 (10ayounsi) 05Open→03Declined This is almost 2 years old now, I don't think we have any other logs to investigate it or if it happened again. Please reopen if you thin... [22:50:07] yayyyyy [22:59:13] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [23:00:04] MaxSem, RoanKattouw, and Niharika: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190502T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:57] 10Operations, 10Cloud-Services, 10netops, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10Krenair) For the record, with the migration away from and shutdown of the nova-network 'main' region, the 208.80.155.128/25 range is no... [23:14:18] 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10greg) [23:37:18] (03PS2) 10Dzahn: Add librenms laravel_app_key fake private key [labs/private] - 10https://gerrit.wikimedia.org/r/507715 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [23:39:22] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "> Fake private key (in public repo) that I meant to say." [labs/private] - 10https://gerrit.wikimedia.org/r/507715 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [23:49:25] (03CR) 10Dzahn: "turns out this change would have been good and now we are at https://gerrit.wikimedia.org/r/c/operations/puppet/+/507852" [puppet] - 10https://gerrit.wikimedia.org/r/342313 (owner: 10Paladox) [23:49:48] (03CR) 10Dzahn: "also see: https://groups.google.com/forum/#!msg/repo-discuss/P5ZuIlh4sQs/_VH38ph3BAAJ" [puppet] - 10https://gerrit.wikimedia.org/r/507852 (owner: 10Paladox) [23:50:10] (03PS3) 10Dzahn: Gerrit: increase sendemail threads to 2 [puppet] - 10https://gerrit.wikimedia.org/r/507852 (owner: 10Paladox) [23:53:22] (03PS1) 10Dzahn: admins: extend access for pbj until May 13th [puppet] - 10https://gerrit.wikimedia.org/r/507901 [23:54:13] (03CR) 10Dzahn: [C: 03+2] admins: extend access for pbj until May 13th [puppet] - 10https://gerrit.wikimedia.org/r/507901 (owner: 10Dzahn) [23:54:59] (03CR) 10Dzahn: [C: 03+2] Gerrit: increase sendemail threads to 2 [puppet] - 10https://gerrit.wikimedia.org/r/507852 (owner: 10Paladox) [23:55:01] (03PS4) 10Dzahn: Gerrit: increase sendemail threads to 2 [puppet] - 10https://gerrit.wikimedia.org/r/507852 (owner: 10Paladox) [23:55:06] thanks mutante! [23:56:04] (03PS3) 10Paladox: Gerrit: Decrease sendemail timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/507853 [23:58:55] (03CR) 10Dzahn: [C: 03+2] Gerrit: Decrease sendemail timeout to 30s [puppet] - 10https://gerrit.wikimedia.org/r/507853 (owner: 10Paladox)