[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181109T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:09] (03PS1) 10Bstorm: sonofgridengine: catch the exception that plagues this in python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/472600 (https://phabricator.wikimedia.org/T200557) [00:01:36] (03CR) 10Bstorm: [C: 032] sonofgridengine: catch the exception that plagues this in python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/472600 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [00:02:14] (03CR) 10Cwhite: initial commit (036 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [00:02:47] (03PS2) 10Cwhite: initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) [00:11:34] 10Operations: Migrate tests from nose to pytest - https://phabricator.wikimedia.org/T208783 (10Bstorm) I'm now poking around also at what it would look like if all the python my team uses ended up in separate packages (debs etc), and I don't hate it...🤔 [00:17:41] (03PS1) 10Jforrester: [Governance wiki] Create new 'editor' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472602 (https://phabricator.wikimedia.org/T205352) [00:17:43] (03PS1) 10Jforrester: [Governance wiki] Allow sysops to grant and remove 'editor' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472603 [00:17:45] (03PS1) 10Jforrester: [Governance wiki] Move edit rights from users to 'editor' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472604 (https://phabricator.wikimedia.org/T205350) [00:17:47] (03PS1) 10Jforrester: [DNM] Drop the 'inactive' user group everywhere, it's unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472605 [00:38:05] (03PS1) 10Bstorm: sonofgridengine: added an important "not" [puppet] - 10https://gerrit.wikimedia.org/r/472606 (https://phabricator.wikimedia.org/T200557) [00:44:45] (03CR) 10Bstorm: [C: 032] sonofgridengine: added an important "not" [puppet] - 10https://gerrit.wikimedia.org/r/472606 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [00:51:22] (03CR) 10Alex Monk: "If it's unused why is this DNM? :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472605 (owner: 10Jforrester) [01:08:33] (03PS1) 10Bstorm: sonofgridengine: correct a bunch of issues in the grid_configurator [puppet] - 10https://gerrit.wikimedia.org/r/472607 (https://phabricator.wikimedia.org/T200557) [01:11:19] (03CR) 10Bstorm: [C: 032] sonofgridengine: correct a bunch of issues in the grid_configurator [puppet] - 10https://gerrit.wikimedia.org/r/472607 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [01:19:43] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:21:03] (03CR) 10Volans: "I didn't review the code, just left one comment for a thing I saw." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [01:30:02] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:15:09] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#4730954, @elukey wrote: >>>! In T203786#472954... [02:40:40] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [02:46:50] (03PS17) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [03:11:15] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [03:12:27] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [03:12:29] I'm going to break the puppet compiler for a bit — hopefully no one is working now anyway :) [03:13:05] (03PS18) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [03:15:20] (03PS10) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [03:35:31] ok, puppet compiler is back up at half capacity. Should be back to normal in 30-40 minutes. [04:45:08] 10Operations, 10SRE-Access-Requests, 10User-CDanis, 10User-herron: pwstore access for cdanis - https://phabricator.wikimedia.org/T209134 (10CDanis) p:05Triage>03Normal [05:15:14] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Onboarding Chris Danis (CDanis) - https://phabricator.wikimedia.org/T208729 (10CDanis) [05:48:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:30:24] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [06:35:12] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:22] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:54] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:50:35] PROBLEM - Host alcyone is DOWN: PING CRITICAL - Packet loss = 100% [07:50:52] RECOVERY - Host alcyone is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms [07:52:26] <_joe_> something bad is happening with network [07:52:41] <_joe_> I see all cp* in codfw unable to talk ith eqiad, or esams [07:53:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:53:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:53:32] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:53:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:53:43] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:54:23] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:54:35] <_joe_> that's the consequence, should be ok now [07:56:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:56:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:56:52] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:57:12] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:57:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:57:52] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:02:35] 10Operations, 10Operations-Software-Development: python3-conftool needs python3-dns - https://phabricator.wikimedia.org/T209136 (10ema) [08:02:45] 10Operations, 10Operations-Software-Development: python3-conftool needs python3-dns - https://phabricator.wikimedia.org/T209136 (10ema) p:05Triage>03Normal [08:04:35] 10Operations, 10Operations-Software-Development: python3-conftool needs python3-dns - https://phabricator.wikimedia.org/T209136 (10ema) [08:05:47] !log repool cp2006, cp2012 (cache_text) T208588 [08:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:50] T208588: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 [08:06:00] _joe_: network blip? [08:06:33] (03CR) 10DCausse: elasticsearch: cookbook for multi-cluster services rolling restart (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [08:06:39] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [08:09:27] (03PS3) 10Elukey: Add timer importing page-history dumps to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [08:09:36] (03PS2) 10Muehlenhoff: Remove obsolete hiera file [puppet] - 10https://gerrit.wikimedia.org/r/472106 [08:10:45] <_joe_> ema: AFAICS yes [08:11:03] <_joe_> ema: codfw lost communication with esams and eqiad, is what I'd bet on [08:13:46] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete hiera file [puppet] - 10https://gerrit.wikimedia.org/r/472106 (owner: 10Muehlenhoff) [08:28:02] !log installing nginx security updates [08:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:42] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 0.81 ms [08:43:15] (03PS1) 10Vgutierrez: Release 0.6 [software/certcentral] - 10https://gerrit.wikimedia.org/r/472621 (https://phabricator.wikimedia.org/T208859) [08:47:08] (03CR) 10Vgutierrez: [C: 032] Release 0.6 [software/certcentral] - 10https://gerrit.wikimedia.org/r/472621 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [08:47:42] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:49:01] (03CR) 10jenkins-bot: Release 0.6 [software/certcentral] - 10https://gerrit.wikimedia.org/r/472621 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [08:50:14] (03PS1) 10Vgutierrez: acme_requests: log order URI on non-recoverable finalization errors [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472622 (https://phabricator.wikimedia.org/T208859) [08:50:18] (03PS1) 10Vgutierrez: certcentral: Evaluate order status after creation [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472623 (https://phabricator.wikimedia.org/T208948) [08:50:26] (03PS1) 10Vgutierrez: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472624 (https://phabricator.wikimedia.org/T208967) [08:50:30] (03PS1) 10Vgutierrez: Release 0.6 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472625 (https://phabricator.wikimedia.org/T208859) [08:51:09] (03PS2) 10Ema: cache: add cp2018 and cp2025 to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/472150 (https://phabricator.wikimedia.org/T208588) [08:52:17] (03CR) 10Ema: [C: 032] cache: add cp2018 and cp2025 to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/472150 (https://phabricator.wikimedia.org/T208588) (owner: 10Ema) [08:52:48] (03CR) 10Vgutierrez: [C: 032] acme_requests: log order URI on non-recoverable finalization errors [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472622 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [08:53:00] (03CR) 10Vgutierrez: [C: 032] certcentral: Evaluate order status after creation [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472623 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [08:53:09] (03CR) 10Vgutierrez: [C: 032] certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472624 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [08:53:25] (03CR) 10Vgutierrez: [C: 032] Release 0.6 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472625 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [08:54:44] (03CR) 10jenkins-bot: acme_requests: log order URI on non-recoverable finalization errors [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472622 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [08:54:54] (03CR) 10jenkins-bot: certcentral: Evaluate order status after creation [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472623 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [08:55:05] (03CR) 10jenkins-bot: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472624 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [08:55:15] (03CR) 10jenkins-bot: Release 0.6 [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472625 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [08:57:10] (03PS1) 10Vgutierrez: debian: Add release 0.6 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472626 (https://phabricator.wikimedia.org/T208859) [08:58:19] 10Operations, 10Traffic, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ` ['cp2018.codfw.wmnet', 'cp2025.codfw.wmnet'] ` The log can be found in `/var/lo... [08:59:16] (03CR) 10Vgutierrez: [C: 032] debian: Add release 0.6 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472626 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [09:00:53] (03CR) 10Gehel: "one (hopefully last) very minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) (owner: 10Mathew.onipe) [09:01:11] (03CR) 10jenkins-bot: debian: Add release 0.6 to changelog [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/472626 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [09:02:50] !log depooling db1106 (T208954) [09:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:53] T208954: Missing row in enwiki.archive on sanitarium - https://phabricator.wikimedia.org/T208954 [09:04:14] (03CR) 10Banyek: [C: 032] mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [09:04:26] (03CR) 10Filippo Giunchedi: "See inline" (036 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [09:04:27] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [09:04:52] (03CR) 10jenkins-bot: mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [09:07:17] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10MoritzMuehlenhoff) [09:08:01] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T189158: depool db1106 (duration: 00m 55s) [09:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:04] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [09:09:21] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [09:10:34] 10Operations, 10Traffic, 10User-ArielGlenn: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10ArielGlenn) [09:12:11] (03PS2) 10Ema: cache: turn on IPsec for cp2018 and cp2025 [puppet] - 10https://gerrit.wikimedia.org/r/472151 (https://phabricator.wikimedia.org/T208588) [09:12:54] (03CR) 10Ema: [C: 032] cache: turn on IPsec for cp2018 and cp2025 [puppet] - 10https://gerrit.wikimedia.org/r/472151 (https://phabricator.wikimedia.org/T208588) (owner: 10Ema) [09:13:00] 10Operations, 10Traffic, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2018.codfw.wmnet', 'cp2025.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2018.codfw.wmnet', 'cp2025.codfw.wmnet'] ` [09:13:22] (03PS8) 10Mathew.onipe: wdqs: separation of concerns [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) [09:15:51] (03PS9) 10Gehel: wdqs: separation of concerns [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) (owner: 10Mathew.onipe) [09:16:51] (03CR) 10Mathew.onipe: wdqs: separation of concerns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) (owner: 10Mathew.onipe) [09:16:55] (03CR) 10Gehel: [C: 032] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) (owner: 10Mathew.onipe) [09:21:02] !log stopping replication on db1106 (T208672) [09:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:05] T208672: Duplicate rows error in db2095 replication @s7 - https://phabricator.wikimedia.org/T208672 [09:21:24] !log stopping replication on db1106 (T208954) [09:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:27] T208954: Missing row in enwiki.archive on sanitarium - https://phabricator.wikimedia.org/T208954 [09:24:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) a:03Gehel [09:30:31] PROBLEM - Check systemd state on cp2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:30:59] that's me ^ [09:34:02] (03PS1) 10Elukey: Relase new upstream version [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/472628 (https://phabricator.wikimedia.org/T208375) [09:37:20] PROBLEM - HTTPS Unified ECDSA on cp2025 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 12 days) [09:37:34] ditto [09:37:50] RECOVERY - Check systemd state on cp2018 is OK: OK - running: The system is fully operational [09:44:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 50 probes of 326 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:45:50] !log truncating enwiki.archive on db1124 and labsdb hosts too (T208954) [09:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:53] T208954: Missing row in enwiki.archive on sanitarium - https://phabricator.wikimedia.org/T208954 [09:48:05] !log repool cp2018, cp2025 (cache_upload) T208588 [09:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:07] T208588: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 [09:49:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 326 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:51:45] 10Operations, 10Traffic, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ema) 05Open>03Resolved [09:52:29] (03PS1) 10Muehlenhoff: Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) [09:52:40] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.84 ms [09:54:50] PROBLEM - HTTPS Unified RSA on cp2018 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 12 days) [09:55:10] PROBLEM - HTTPS Unified ECDSA on cp2018 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 12 days) [09:57:11] ACKNOWLEDGEMENT - HTTPS Unified ECDSA on cp2018 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 12 days) Ema https://phabricator.wikimedia.org/T208603 [09:57:11] ACKNOWLEDGEMENT - HTTPS Unified RSA on cp2018 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 12 days) Ema https://phabricator.wikimedia.org/T208603 [09:59:40] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:03:12] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Onboarding Chris Danis (CDanis) - https://phabricator.wikimedia.org/T208729 (10MoritzMuehlenhoff) [10:03:14] 10Operations, 10SRE-Access-Requests, 10User-CDanis, 10User-herron: pwstore access for cdanis - https://phabricator.wikimedia.org/T209134 (10MoritzMuehlenhoff) 05Open>03Resolved I've added you to pwstore, please see https://office.wikimedia.org/wiki/Pwstore for some docs. If you run into any issues, pin... [10:03:29] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Onboarding Chris Danis (CDanis) - https://phabricator.wikimedia.org/T208729 (10MoritzMuehlenhoff) [10:05:00] RECOVERY - Host kubestage1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [10:06:19] 10Operations, 10ops-eqiad: Broken memory on mw1239 - https://phabricator.wikimedia.org/T209139 (10MoritzMuehlenhoff) [10:07:05] 10Operations, 10ops-eqiad: Broken memory on mw1239 - https://phabricator.wikimedia.org/T209139 (10MoritzMuehlenhoff) Server is depooled for now [10:07:30] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 67.01 ge 4 Muehlenhoff T209139 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad%2520prometheus%252Fops [10:08:37] (03PS2) 10GTirloni: Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/472530 (https://phabricator.wikimedia.org/T189158) [10:12:00] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:12:23] 10Operations, 10ops-eqiad: Broken mgmt on kubestage1001 - https://phabricator.wikimedia.org/T209140 (10MoritzMuehlenhoff) [10:12:24] 10Operations, 10ops-eqiad: Broken mgmt on kubestage1001 - https://phabricator.wikimedia.org/T209140 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:13:20] ACKNOWLEDGEMENT - SSH kubestage1001.mgmt on kubestage1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Muehlenhoff T209140 [10:50:54] !log uploaded certcentral 0.6 to apt.wikimedia.org (stretch) - T208859 T208948 T208967 T208970 [10:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:05] T208948: certcentral "wrongly" assumes that a new order always implies fulfilling new challenges - https://phabricator.wikimedia.org/T208948 [10:51:05] T208970: certcentral wrongly handles acme.errors.ValidationError exception - https://phabricator.wikimedia.org/T208970 [10:51:05] T208967: Avoid using acme.client poll_and_finalize() method - https://phabricator.wikimedia.org/T208967 [10:51:06] T208859: certcentral: keep track of orders and authorizations IDs when issuing certificates - https://phabricator.wikimedia.org/T208859 [11:10:18] (03Abandoned) 10Muehlenhoff: Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [11:12:10] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.77 ms [11:19:10] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:21:21] (03CR) 10Effie Mouzeli: [C: 032] role::eqiad::scb: switch rdb1003:6382 with rdb1005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472454 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:21:32] (03PS2) 10Effie Mouzeli: role::eqiad::scb: switch rdb1003:6382 with rdb1005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472454 (https://phabricator.wikimedia.org/T206450) [11:22:18] !log switch scb*.eqiad.wmnet nutcracker rdb1003:6382 with rdb1005:6379 [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:26] !log upgrade apertium apertium-cat apertium-fra apertium-fra-cat apertium-lex-tools apertium-separable cg3 libapertium3-3.5-1 libcg3-1 lttoolbox on scb1002 [11:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:44] !log upgrade apertium apertium-cat apertium-fra apertium-fra-cat apertium-lex-tools apertium-separable cg3 libapertium3-3.5-1 libcg3-1 lttoolbox on all scb boxes and restart apertium-apy [11:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:47] kart_: ^ [11:36:04] noted akosiaris [11:38:08] 10Operations, 10Patch-For-Review, 10SCB: Changeprop: Error during deduplication - https://phabricator.wikimedia.org/T209064 (10jijiki) 05Open>03Resolved [11:39:31] !log akosiaris@puppetmaster1001 conftool action : set/weight=8; selector: dc=eqiad,service=apertium,cluster=scb,name=scb1001.* [11:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:30] !log set previous normal wait for scb1001 for apertium service T206439 [11:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:35] T206439: Package: lttoolbox, apertium, cg3, hfst and hfst-ospell - https://phabricator.wikimedia.org/T206439 [11:45:32] !log data load finished restarting replication on db1106 (T208954) [11:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:35] T208954: Missing row in enwiki.archive on sanitarium - https://phabricator.wikimedia.org/T208954 [11:48:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 214.68 seconds [11:54:55] akosiaris: Thanks a lot for all help in deployment. [11:55:12] process question: does a one-off maintenance script to fix values of a field in a database need to be /merged/ (it needs to be reviewed, definitely) before it's run on mwmaint1002? [11:55:30] (context: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/472596/) [11:56:20] kart_: thanks as well [11:57:22] phuedx: I would merge it tbh unless you have valid reasons for not wanting too. You can always delete later [12:03:45] (03PS1) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [12:04:57] (03CR) 10jerkins-bot: [V: 04-1] Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [12:07:08] (03CR) 10Addshore: [C: 04-1] Enable SSR termbox for wikibase on beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [12:07:33] (03PS2) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [12:12:32] akosiaris: thanks! would i scap pull on mwmaint or wait for a (swat) deploy? [12:14:10] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.84 ms [12:15:20] phuedx: whatever suits you best [12:16:14] hmm what's up with kubestate1001 managemnt ? [12:16:23] !log kartik@deploy1001 Started deploy [cxserver/deploy@fc21164]: Update cxserver to 01686f6 (T208831) [12:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:26] T208831: Make Apertium tests independent of Labs service - https://phabricator.wikimedia.org/T208831 [12:17:32] !log kartik@deploy1001 Finished deploy [cxserver/deploy@fc21164]: Update cxserver to 01686f6 (T208831) (duration: 01m 09s) [12:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:50] I am counting 99% packetloss from bast1002 [12:17:54] hmm [12:18:16] akosiaris: filed a ticket earlier [12:18:33] https://phabricator.wikimedia.org/T209112 [12:18:46] it's been flapping throughout the morning [12:21:10] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:28:24] (03PS3) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [12:34:15] !log repooling db1106 (T208954) [12:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:19] T208954: Missing row in enwiki.archive on sanitarium - https://phabricator.wikimedia.org/T208954 [12:34:53] (03PS1) 10Banyek: Revert "mariadb: depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472646 [12:35:05] 10Operations, 10netops: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ema) [12:35:15] 10Operations, 10netops: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ema) p:05Triage>03Normal [12:36:43] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472646 (owner: 10Banyek) [12:37:08] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472646 (owner: 10Banyek) [12:37:31] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.72 ms [12:37:52] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472646 (owner: 10Banyek) [12:40:49] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T189158: repool db1106 (duration: 00m 53s) [12:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:52] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [12:42:12] (03CR) 10Addshore: [C: 04-1] Enable SSR termbox for wikibase on beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [12:49:06] (03CR) 10jenkins-bot: Revert "mariadb: depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472646 (owner: 10Banyek) [12:50:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks nice to me, minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/471914 (owner: 10Giuseppe Lavagetto) [12:53:05] 10Operations, 10ops-codfw: Degraded RAID on heze-array1 - https://phabricator.wikimedia.org/T206909 (10akosiaris) @Papaul I 'd say ignore it. That system+disk self/array is scheduled for decomission, to be replaced with backup2001 (T196477). The data in it is a copy of the data from helium so we ain't gonna... [12:54:01] (03CR) 10Giuseppe Lavagetto: profile::base: allow nodes to page when down (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/471914 (owner: 10Giuseppe Lavagetto) [12:54:31] 10Operations, 10netops: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10elukey) [12:54:40] 10Operations, 10netops: Investiagate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10elukey) [12:55:33] (03PS4) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [12:59:16] 10Operations, 10netops: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10Aklapper) [13:04:05] (03PS2) 10Giuseppe Lavagetto: profile::base: allow nodes to page when down [puppet] - 10https://gerrit.wikimedia.org/r/471914 [13:07:20] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=DELETE https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:07:24] (03PS8) 10Pipix: RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) [13:08:21] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:08:26] (03CR) 10Pipix: RSS: Update URLs to the old Wikimedia Foundation blog to point to the new site (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471260 (https://phabricator.wikimedia.org/T208458) (owner: 10Pipix) [13:10:02] !log upgrading qemu on ganeti2001 (packages supporting SSBD passthrough) [13:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:13] (03CR) 10Addshore: [C: 04-1] Enable SSR termbox for wikibase on beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [13:18:43] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10faidon) Why is this still pending? [13:19:19] (03CR) 10Andrew Bogott: [C: 031] profile::base: allow nodes to page when down [puppet] - 10https://gerrit.wikimedia.org/r/471914 (owner: 10Giuseppe Lavagetto) [13:21:31] !log upload graphite-web_1.0.2+debian-2.1wmf1 to stretch-wikimedia - T208782 [13:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:34] T208782: Graphite error causing breakage of Graphite-backed Grafana dashboards - https://phabricator.wikimedia.org/T208782 [13:22:20] (03PS5) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [13:23:29] (03CR) 10jerkins-bot: [V: 04-1] Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [13:28:28] (03PS6) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [13:31:37] (03PS11) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [13:32:52] !log rebooting acrab for some qemu tests [13:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:12] (03PS7) 10Matthias Geisler: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) [13:33:14] (03PS5) 10MacFan4000: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) [13:37:07] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10fgiunchedi) [13:37:36] (03PS3) 10Andrew Bogott: profile::base: allow nodes to page when down [puppet] - 10https://gerrit.wikimedia.org/r/471914 (owner: 10Giuseppe Lavagetto) [13:37:38] (03PS1) 10Andrew Bogott: openstack: remove explicit setting of contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/472652 (https://phabricator.wikimedia.org/T206224) [13:38:40] (03CR) 10jerkins-bot: [V: 04-1] openstack: remove explicit setting of contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/472652 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [13:39:33] (03PS2) 10Andrew Bogott: openstack: remove explicit setting of contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/472652 (https://phabricator.wikimedia.org/T206224) [13:43:33] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/472653 (https://phabricator.wikimedia.org/T204745) [13:44:47] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/472653 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [13:46:41] (03CR) 10Filippo Giunchedi: [C: 031] Relase new upstream version [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/472628 (https://phabricator.wikimedia.org/T208375) (owner: 10Elukey) [14:14:33] jouncebot: now [14:14:33] No deployments scheduled for the next 68 hour(s) and 15 minute(s) [14:14:36] aaah yes [14:14:41] * addshore goes to merge something for beta [14:14:47] (03PS8) 10Addshore: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [14:14:59] (03CR) 10Addshore: [C: 032] Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [14:16:32] (03Merged) 10jenkins-bot: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [14:18:44] (03PS4) 10Giuseppe Lavagetto: profile::base: allow nodes to page when down [puppet] - 10https://gerrit.wikimedia.org/r/471914 [14:18:49] !log addshore@deploy1001 Synchronized wmf-config: BETA ONLY: Enable SSR termbox for wikibase on beta - T209143 (duration: 00m 56s) [14:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:52] T209143: Enable termbox on beta - https://phabricator.wikimedia.org/T209143 [14:20:31] (03CR) 10jenkins-bot: Enable SSR termbox for wikibase on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472642 (https://phabricator.wikimedia.org/T209143) (owner: 10Matthias Geisler) [14:21:11] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::base: allow nodes to page when down [puppet] - 10https://gerrit.wikimedia.org/r/471914 (owner: 10Giuseppe Lavagetto) [14:21:56] (03PS3) 10Giuseppe Lavagetto: openstack: remove explicit setting of contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/472652 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [14:25:58] (03CR) 10Giuseppe Lavagetto: [C: 032] openstack: remove explicit setting of contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/472652 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [14:39:08] !log ladsgroup@deploy1001 Started deploy [ores/deploy@0728805]: T191842 T209060 [14:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:12] T191842: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 [14:39:13] T209060: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 [14:39:20] akosiaris: ^ [14:40:06] Amir1: same procedure as last time ? git lfs pull everywhere ? [14:40:32] akosiaris: if needed, it should not (my patch should have fixed it) [14:41:46] ok lemme do it and see if that's true [14:42:30] yup it's true [14:42:36] I did not pull anything [14:42:41] \o/ [14:42:59] what was the issue after all ? [14:43:34] it was tracking basically all files as git lfs, even some .git files [14:43:45] like .gitignore [14:44:08] I guess that would make it to do git lfs pull twice [14:44:28] PROBLEM - puppet last run on cloudvirt1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:44:33] untracking non-binary files fixed the issue [14:44:50] <_joe_> andrewbogott: is it paging? [14:45:02] _joe_: not so far! [14:47:15] Amir1: lol. good catch! [14:47:44] :D [14:48:40] !log ladsgroup@deploy1001 deploy aborted: T191842 T209060 (duration: 09m 32s) [14:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:44] T191842: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 [14:48:44] T209060: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 [14:48:52] Yup, I need to do something else as well [14:48:56] :/ [14:49:21] !log ladsgroup@deploy1001 Started deploy [ores/deploy@bb39f4b]: T191842 T209060, try II [14:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:28] RECOVERY - puppet last run on cloudvirt1024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:50:56] !log rebooting cloudvirt1024 to (I hope) cause a page [14:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] Hey everyone, I'm about to cause a page on cloudvirt1024 on purpose to double-check some work _joe_ just did. [14:52:33] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10faidon) So the aforementioned functionality [[ https://github.com/digitalocean/netbox/issues/2367 | was removed ]] as obsolete due to NAPALM support replacing it and will not be part of the 2.5 r... [14:53:03] (03PS1) 10Filippo Giunchedi: mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 [14:53:05] (03PS1) 10Filippo Giunchedi: mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 [14:53:54] PROBLEM - Host cloudvirt1024 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:00] (03CR) 10jerkins-bot: [V: 04-1] mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 (owner: 10Filippo Giunchedi) [14:54:02] (03CR) 10jerkins-bot: [V: 04-1] mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [14:54:31] well, it worked [14:54:33] RECOVERY - Host cloudvirt1024 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:56:16] <_joe_> :) [14:56:29] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Lydia_Pintscher) @Addshore can you close this? [14:56:58] (03CR) 10Filippo Giunchedi: "Thanks for taking a look! I've published a couple of followup on the same topic: https://gerrit.wikimedia.org/r/q/topic:%22mtail-test%22+(" [puppet] - 10https://gerrit.wikimedia.org/r/472200 (owner: 10Bstorm) [14:58:07] (03Abandoned) 10Andrew Bogott: host monitoring: create new hiera global, host_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/464856 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [14:58:09] (03CR) 10Filippo Giunchedi: "Jenkins failure is expected, tests are still broken but now the output is more meaningful, which is the point of this CR." [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [14:58:17] (03Abandoned) 10Andrew Bogott: labvirt/cloudvirt hosts: only page when a host is down [puppet] - 10https://gerrit.wikimedia.org/r/464857 (https://phabricator.wikimedia.org/T206224) (owner: 10Andrew Bogott) [14:59:25] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS: Fewer transitory middle-of-the-night puppet alerts - https://phabricator.wikimedia.org/T206224 (10Andrew) 05Open>03Resolved This should be fixed, thanks to Giuseppe's changes. [15:04:04] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@bb39f4b]: T191842 T209060, try II (duration: 14m 43s) [15:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:08] T191842: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 [15:04:09] T209060: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 [15:04:14] (03PS2) 10Filippo Giunchedi: mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 [15:04:16] (03PS2) 10Filippo Giunchedi: mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 [15:04:43] (03CR) 10jerkins-bot: [V: 04-1] mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [15:05:04] (03CR) 10jerkins-bot: [V: 04-1] mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 (owner: 10Filippo Giunchedi) [15:06:35] !log cp1008/pinkunicorn: puppet disabled, public-facing testing of new globalsign 2018 certs [15:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:58] !log repooling labsdb1009 (T189158) [15:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [15:09:04] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/472530 (https://phabricator.wikimedia.org/T189158) (owner: 10GTirloni) [15:09:05] PROBLEM - IPsec on rdb1007 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: rdb2005_v4 [15:09:26] (03PS3) 10Banyek: Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/472530 (https://phabricator.wikimedia.org/T189158) (owner: 10GTirloni) [15:09:31] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/472530 (https://phabricator.wikimedia.org/T189158) (owner: 10GTirloni) [15:09:33] (03PS3) 10Filippo Giunchedi: mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 [15:09:35] (03PS3) 10Filippo Giunchedi: mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 [15:09:50] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review, and 2 others: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 (10Ladsgroup) With the new number of 9 parallel connection and 14 minutes to deploy (down from around half an hour), I th... [15:10:15] (03CR) 10jerkins-bot: [V: 04-1] mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [15:10:37] (03CR) 10jerkins-bot: [V: 04-1] mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 (owner: 10Filippo Giunchedi) [15:13:06] 10Operations, 10ops-codfw: Degraded RAID on heze-array1 - https://phabricator.wikimedia.org/T206909 (10Papaul) 05Open>03Resolved @akosiaris thanks. Resolving this task. [15:13:39] (03CR) 10Filippo Giunchedi: "Ok part of the problem is an older version of mtail 0.0+git20161231.ae129e9-1+b2 in the image as opposed to 3.0.0~rc5-1~bpo9+1 from stretc" [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [15:16:19] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) a:05Papaul>03Banyek If this is in testing mode , I think it needs to be assigned to DB. [15:17:45] (03PS1) 10Effie Mouzeli: role::codfw::scb: switch rdb2003:6382 with rdb2005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472669 (https://phabricator.wikimedia.org/T206450) [15:17:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) Ah sorry we were talking about this off-phab: the stripe size will be good [15:19:33] (03PS1) 10GTirloni: wiki replicas: depool lasbdb1011 for changes [puppet] - 10https://gerrit.wikimedia.org/r/472670 (https://phabricator.wikimedia.org/T189158) [15:22:25] (03CR) 10Banyek: [C: 032] wiki replicas: depool lasbdb1011 for changes [puppet] - 10https://gerrit.wikimedia.org/r/472670 (https://phabricator.wikimedia.org/T189158) (owner: 10GTirloni) [15:23:26] !log depooling labsdb1011 [15:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:59] !log depooling labsdb1011 (T189158) [15:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:02] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [15:24:06] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13424/scb2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472669 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:24:28] (03CR) 10Effie Mouzeli: [C: 032] role::codfw::scb: switch rdb2003:6382 with rdb2005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472669 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:28:24] (03PS2) 10Effie Mouzeli: role::codfw::scb: switch rdb2003:6382 with rdb2005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472669 (https://phabricator.wikimedia.org/T206450) [15:36:56] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10thcipriani) p:05Triage>03Normal [15:37:41] anomie: hello! I hope you don't mind terribly but I've added you to the patchset for our prehistoric bug: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/472596/. Your review would be immensely appreciated. [15:38:35] niedzielski: I'll try to review it later today. [15:38:59] thank you so, so much anomie !! [15:44:05] (03PS1) 10Vgutierrez: acme_requests: Fix finalize_order() exception handling [software/certcentral] - 10https://gerrit.wikimedia.org/r/472676 (https://phabricator.wikimedia.org/T208967) [15:45:04] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:57] (03CR) 10Alex Monk: [C: 032] acme_requests: Fix finalize_order() exception handling [software/certcentral] - 10https://gerrit.wikimedia.org/r/472676 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [15:47:10] 10Operations, 10ops-eqiad: kubestage1001.mgmt down or flapping - https://phabricator.wikimedia.org/T209112 (10fgiunchedi) [15:47:12] 10Operations, 10ops-eqiad: Broken mgmt on kubestage1001 - https://phabricator.wikimedia.org/T209140 (10fgiunchedi) [15:47:28] I'm silencing the kubestage1001.mgmt until chris is back [15:47:48] (03Merged) 10jenkins-bot: acme_requests: Fix finalize_order() exception handling [software/certcentral] - 10https://gerrit.wikimedia.org/r/472676 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [15:49:35] (03CR) 10jenkins-bot: acme_requests: Fix finalize_order() exception handling [software/certcentral] - 10https://gerrit.wikimedia.org/r/472676 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [15:56:42] do you know if/how it is possible to pin packages to backports in dockerfiles found in integration config? I want to use mtail from stretch-backports for puppet.git's docker image [15:56:54] context is https://gerrit.wikimedia.org/r/c/operations/puppet/+/472666 [16:07:04] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 337.74 ms [16:08:36] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471534 (https://phabricator.wikimedia.org/T208663) (owner: 10Zoranzoki21) [16:13:51] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10RobH) [16:13:54] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) 05Open>03stalled p:05High>03Low Vivian @ EQ Singapore fixed it, adding in maint announce to their alerts. We should get the next ale... [16:14:00] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) >>! In T... [16:21:00] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:27:10] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) >>! In T153468#473... [16:27:35] (03CR) 10Cwhite: [C: 031] Remove Diamond from caches [puppet] - 10https://gerrit.wikimedia.org/r/472632 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:29:48] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [16:30:09] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [16:30:25] (03PS19) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [16:35:19] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Onboarding Chris Danis (CDanis) - https://phabricator.wikimedia.org/T208729 (10CDanis) 05Open>03Resolved [16:36:24] thank you anomie for the great feedback!! [16:37:55] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10debt) [16:37:59] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Separation of concerns for WDQS puppet module - https://phabricator.wikimedia.org/T208394 (10debt) 05Open>03Resolved [16:39:34] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor puppet WDQS module - https://phabricator.wikimedia.org/T208201 (10debt) [16:39:39] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Fix Type constraints in wdqs (init.pp) - https://phabricator.wikimedia.org/T208393 (10debt) 05Open>03Resolved [16:39:43] (03PS1) 10Bstorm: sonofgridengine: correct parsing of the configs [puppet] - 10https://gerrit.wikimedia.org/r/472682 (https://phabricator.wikimedia.org/T200557) [16:47:06] (03CR) 10Bstorm: [C: 032] sonofgridengine: correct parsing of the configs [puppet] - 10https://gerrit.wikimedia.org/r/472682 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:02:02] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10Halfak) If we could handle twice the capacity, then we could allow researchers to query us twice as fast. :) We could lift our simultaneous... [17:04:54] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:08:04] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:13:12] (03CR) 10Legoktm: Add PHP version information to log entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [17:24:13] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10Ladsgroup) We can't increase number of our celery workers due to memory allocation. Maybe we need to tackle {T182350} then fix things and the... [17:40:22] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472688 [17:40:24] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472689 [17:40:26] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472690 [17:40:28] (03PS1) 10Zoranzoki21: Add new throttle rule for Wikipedia event in Ireland on 2018-11-13, remove expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) [17:41:07] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472688 (owner: 10Zoranzoki21) [17:41:10] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472690 (owner: 10Zoranzoki21) [17:41:14] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472689 (owner: 10Zoranzoki21) [17:42:25] (03PS2) 10Zoranzoki21: Add new throttle rule for Wikipedia event in Ireland on 2018-11-13, remove expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) [17:42:40] (03PS3) 10Zoranzoki21: Add new throttle rule for Wikipedia event in Ireland on 2018-11-13, remove expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) [17:50:35] (03PS1) 10Bstorm: sonofgridengine: Convert everything coming out of SGE to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/472692 (https://phabricator.wikimedia.org/T200557) [17:53:04] (03CR) 10Bstorm: [C: 032] sonofgridengine: Convert everything coming out of SGE to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/472692 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [17:55:23] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) Sigh. Cannot. Phab broken. (I logged in as `@Phabricator_maintenance`, I clicked "Move Tasks to Column..." in the dropdown of the "Backlog" column header on the Operations work... [18:05:45] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808) > * OpenStack Horizon (dashboard) > * Wikimedia Striker (toolsadmin) Both of these services receive develope... [18:07:25] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) [18:08:42] (03PS1) 10Herron: kafka_shipper: pin librdkafka1 to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472694 (https://phabricator.wikimedia.org/T206454) [18:11:52] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808) > * DB replicas The Wiki Replica servers contain information considered sensitive by our privacy policies. T... [18:12:44] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808) [18:15:24] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) >>! In T207536#4735633, @bd808 wrote: >> * OpenStack Horizon (dashboard) >> * Wikimedia Striker (toolsadmin... [18:15:44] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [18:16:33] (03CR) 10Herron: "since profile::rsyslog::kafka_shipper is intended to be deployed to any host that sends logs to kafka (virtually any host) would there be " [puppet] - 10https://gerrit.wikimedia.org/r/472694 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [18:17:27] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [18:17:29] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10bd808) [18:18:30] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10Krenair) [18:19:18] (03CR) 10Cwhite: "> (6 comments)" (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [18:20:19] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10colewhite) p:05Triage>03Low [18:21:27] 10Operations, 10ops-eqiad, 10DC-Ops: kubestage1001.mgmt down or flapping - https://phabricator.wikimedia.org/T209112 (10colewhite) p:05Triage>03Normal [18:22:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10colewhite) p:05Triage>03Normal [18:26:25] (03PS1) 10GTirloni: Revert "wiki replicas: depool lasbdb1011 for changes" [puppet] - 10https://gerrit.wikimedia.org/r/472696 (https://phabricator.wikimedia.org/T189158) [18:47:03] 10Operations: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10greg) p:05Triage>03High [18:55:41] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10Halfak) It looks like we do have a bit more ceiling for memory usage. I half-remembered us tuning our worker-count down due to issues with c... [19:03:14] PROBLEM - IPsec on rdb1003 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: rdb2003_v4 [19:03:34] 10Operations, 10SRE-Access-Requests: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10jijiki) [19:08:06] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [19:08:47] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [19:17:24] RECOVERY - SSH kubestage1001.mgmt on kubestage1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) [19:28:24] 10Operations, 10Goal, 10Patch-For-Review, 10Technical-Debt, and 2 others: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195 (10CDanis) [19:31:46] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10CDanis) [19:32:03] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10CDanis) [19:32:29] oh wow, adding your personal projects to tags makes a lot of noise [19:35:36] cdanis yup, wikibugs monitors any changes (except if you move it accross the boards) [19:35:54] RECOVERY - IPsec on rdb1003 is OK: Strongswan OK - 1 ESP OK [19:39:51] !log repooling labsdb1011 (T189158) [19:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:54] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [19:39:56] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool lasbdb1011 for changes" [puppet] - 10https://gerrit.wikimedia.org/r/472696 (https://phabricator.wikimedia.org/T189158) (owner: 10GTirloni) [19:42:50] (03PS1) 1020after4: Add my pgp key to mediawiki.org/keys/keys.(txt|html) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) [19:46:38] (03PS1) 10Legoktm: keys: Add Mukunda Modell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472712 [19:47:25] (03Abandoned) 10Legoktm: keys: Add Mukunda Modell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472712 (owner: 10Legoktm) [19:48:07] (03CR) 10Legoktm: [C: 04-1] "On the HTML version, your key should go above the people who are former, so right before Chris and after Brian." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) (owner: 1020after4) [19:48:34] (03PS1) 10Dzahn: icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 [19:49:34] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (owner: 10Dzahn) [19:51:15] (03PS1) 10Effie Mouzeli: Reimage rdb2003/rdb2004 [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) [20:02:30] (03PS3) 10Rush: AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) [20:04:10] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, and 3 others: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10jijiki) @Joe, we are trying to fix this wit... [20:05:32] (03CR) 10Effie Mouzeli: "Let's coordinate with Giuseppe on the relevant Phab task" [puppet] - 10https://gerrit.wikimedia.org/r/470877 (https://phabricator.wikimedia.org/T150375) (owner: 10Thifranc) [20:05:44] (03CR) 10jerkins-bot: [V: 04-1] AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [20:05:56] (03PS4) 10Rush: AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) [20:06:55] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10faidon) Can this task be resolved, given we have T178592 to track the bast4001 decom? [20:08:38] (03CR) 10Rush: "I would like to let this run over the weekend w/ the more contextual alerting and I don't want to leave puppet disabled so I'm going to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [20:08:43] (03CR) 10Rush: [C: 032] AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [20:08:50] !log restarted neutron-linuxbridge-agent on cloudvirt1018 and cloudvirt1023 [20:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:30] 10Operations, 10ChangeProp, 10Core Platform Team Backlog (Watching / External), 10SCB, 10Services (watching): Changeprop: Error during deduplication - https://phabricator.wikimedia.org/T209064 (10mobrovac) [20:11:08] 10Operations, 10ChangeProp, 10Core Platform Team Backlog (Watching / External), 10SCB, 10Services (watching): Changeprop: Error during deduplication - https://phabricator.wikimedia.org/T209064 (10mobrovac) Thank you @jijiki for the investigation and quick fix! [20:14:07] (03CR) 10Anomie: Add PHP version information to log entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [20:21:55] (03PS2) 10Effie Mouzeli: Reimage rdb2003/rdb2004, switch rdb100[123478] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) [20:25:15] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10jijiki) 05Open>03Resolved [20:25:18] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [20:25:37] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10jijiki) Both rdb1009 and rdb1010 are in production. [20:29:15] (03PS2) 1020after4: Add my pgp key to mediawiki.org/keys/keys.(txt|html) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) [20:30:42] (03CR) 1020after4: "> On the HTML version, your key should go above the people who are" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) (owner: 1020after4) [20:31:21] 10Operations, 10SRE-Access-Requests: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10mmodell) [20:33:36] (03CR) 10BryanDavis: Add PHP version information to log entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [20:35:04] (03CR) 10BryanDavis: [C: 031] Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [20:36:41] (03PS3) 10Legoktm: Add my (20after4) PGP key to mediawiki.org/keys/keys.(txt|html) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) (owner: 1020after4) [20:36:49] (03CR) 10Legoktm: [C: 032] Add my (20after4) PGP key to mediawiki.org/keys/keys.(txt|html) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) (owner: 1020after4) [20:38:11] (03Merged) 10jenkins-bot: Add my (20after4) PGP key to mediawiki.org/keys/keys.(txt|html) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) (owner: 1020after4) [20:38:42] legoktm: shall I deploy ^ [20:38:55] or does it fall under the no-deploy-friday rule [20:38:57] already am :) [20:39:00] ah ok [20:39:15] (03CR) 10jenkins-bot: Add my (20after4) PGP key to mediawiki.org/keys/keys.(txt|html) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472711 (https://phabricator.wikimedia.org/T209105) (owner: 1020after4) [20:40:11] !log legoktm@deploy1001 Synchronized docroot/mediawiki/keys/: Add my (20after4) PGP key to mediawiki.org/keys/keys.(txt|html) (duration: 00m 55s) [20:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:58] (03PS1) 10Bstorm: sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) [20:42:49] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:44:24] 10Operations, 10User-Joe, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki) p:05Triage>03Normal [20:44:50] (03PS3) 10Effie Mouzeli: Reimage rdb2003/rdb2004, switch rdb100[123478] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) [20:45:32] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki) [20:45:34] 10Operations: netbox won't allow me to upload photos of the rack - https://phabricator.wikimedia.org/T209182 (10RobH) p:05Triage>03Normal [20:45:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.09 seconds [20:46:11] 10Operations: netbox won't allow me to upload photos of the rack - https://phabricator.wikimedia.org/T209182 (10RobH) Not an emergency, so I did NOT go livehacking any permissions changes. [20:46:29] !log depooled wdqs1004 to let it catch up [20:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:50] (03PS2) 10Bstorm: sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) [20:47:36] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:47:49] (03PS1) 10Herron: wmcs: cut over to new smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) [20:48:58] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) Now that it looks like eqiad1-r migrations are under way I think we'll want to do the cut-over sooner rather than later. [20:50:08] (03PS4) 10Effie Mouzeli: Reimage rdb2003/rdb2004, switch rdb100[123478] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) [20:52:40] (03CR) 10Dzahn: [C: 031] "yea, if all those eqiad servers are done and out of production then they should move to spare. and if 2003/2004 should become a misc::mast" [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [20:52:45] 10Operations, 10decommission, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10jijiki) [20:53:18] (03CR) 10Effie Mouzeli: [C: 032] Reimage rdb2003/rdb2004, switch rdb100[123478] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [20:53:33] (03PS5) 10Effie Mouzeli: Reimage rdb2003/rdb2004, switch rdb100[123478] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/472714 (https://phabricator.wikimedia.org/T206450) [20:56:11] 10Operations: netbox won't allow me to upload photos of the rack - https://phabricator.wikimedia.org/T209182 (10Volans) @RobH yep, known issue, the immediate fix was already scheduled in https://gerrit.wikimedia.org/r/c/operations/puppet/+/463820 but then we decided to go directly in the direction of using swift... [20:57:50] (03CR) 10Volans: [C: 04-1] "Please no more python2, as we decided in the last offsite, see https://phabricator.wikimedia.org/T197804 for context ;)" [puppet] - 10https://gerrit.wikimedia.org/r/472713 (owner: 10Dzahn) [20:59:45] (03PS3) 10Bstorm: sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) [20:59:52] volans|off: since I know you are here: [20:59:59] no more python? or no more python2? [21:00:35] python2 ofc :) we all love python ;) [21:00:46] mutante: ^^ [21:01:00] probably minimal changes for python 3 compat [21:01:07] (03PS4) 10Bstorm: sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) [21:01:31] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) >>! In T153468#469... [21:01:46] python2 will stop being supported at the end of 2019, we need to slowly migrate existing code, but at least not adding any more py2 unless stricly required by some very specific dependency we cannot avoid [21:02:07] and also in that case we need a already have a plan to find an alternative solution [21:02:24] I suspect the script under discussion will have minimal to convert, but let's let him get it done as py2 and then convert before the merge [21:02:44] and I have it in my head to start work on all my stuff soon [21:02:53] like this quarter... it's going to not be prety [21:02:56] *pretty [21:03:15] not that urgent, just keep that in mind going forward [21:03:31] (03CR) 10Bstorm: [C: 032] sonofgridengine: Add checkpoints [puppet] - 10https://gerrit.wikimedia.org/r/472719 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:03:39] starting with not adding tech-debt is already great ;) [21:03:41] (03PS1) 10Effie Mouzeli: Reimage rdb2003/rdb2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472729 (https://phabricator.wikimedia.org/T206450) [21:04:05] yep [21:04:18] I didn't know about that decision from the offsite, so it's good to hear [21:04:40] I just know that my dumps stuff has a lot of moving parts and if I want to not be rushing I gotta get on it soon [21:06:00] you can add to the wall https://pythonclock.org/ :-P [21:06:20] no thanks, just what I need is another stress indicator :-D [21:06:27] ahahaha [21:06:45] btw I was planning to do an ops session on py2->3 migration, as soon as I find the time to prepare it a bit [21:06:48] but it will be nice to use some of the v3 modules, they are so much nicer [21:06:57] also quicker ;) [21:07:01] I think wasn;t that in the list of ops presentation proposals? [21:07:03] (03CR) 10Alex Monk: [C: 031] "let's give it a whirl" [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [21:07:06] and I +1'ed it [21:07:08] yep is there [21:07:10] yeah [21:08:40] (03CR) 10Dzahn: [C: 031] Reimage rdb2003/rdb2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472729 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [21:09:55] (03CR) 10Effie Mouzeli: [C: 032] Reimage rdb2003/rdb2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472729 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [21:10:06] (03PS2) 10Effie Mouzeli: Reimage rdb2003/rdb2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472729 (https://phabricator.wikimedia.org/T206450) [21:13:36] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.30 - https://phabricator.wikimedia.org/T209184 (10akosiaris) 05Open>03Resolved Upgrade completed successfully. Also checked with a SELECT * FROM version of the 2 sql statements displayed in https://community.otrs.com/security-advisory-2018-09-security-updat... [21:15:24] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['rdb2003.codfw.wmnet'] ` The log can be found in `/var/... [21:15:47] (03CR) 10Alex Monk: [C: 031] "(I should probably note that mx-out01.wmflabs.org has been successfully getting a decent bit of mail out to me from shinken)" [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [21:16:24] !log Reimaging rdb2003, rdb2004 - T206450 [21:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:27] T206450: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 [21:16:58] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['rdb2004.codfw.wmnet'] ` The log can be found in `/var/... [21:30:53] 10Operations, 10Mail, 10User-herron: Mail relays needed for VMs in eqiad1 - https://phabricator.wikimedia.org/T205158 (10herron) [21:30:55] (03PS2) 10Dzahn: icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 [21:30:56] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) [21:31:43] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (owner: 10Dzahn) [21:33:20] (03PS2) 10Andrew Bogott: wmcs: cut over to new smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [21:34:10] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10Krenair) So that seems to w... [21:37:41] 10Puppet, 10cloud-services-team, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10Bstorm) [21:38:02] 10Puppet, 10Proposal, 10cloud-services-team (Kanban): Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10Bstorm) [21:38:22] 10Operations: Migrate tests from nose to pytest - https://phabricator.wikimedia.org/T208783 (10Bstorm) [21:38:25] 10Puppet, 10Proposal, 10cloud-services-team (Kanban): Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10Bstorm) [21:40:48] (03PS2) 10Legoktm: Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) [21:41:13] (03CR) 10Legoktm: Add PHP version information to log entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [21:42:56] 10Operations, 10DNS, 10Mail, 10User-herron: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065 (10herron) 05stalled>03Resolved a:03herron Well, looks like no DKIM. [21:45:36] (03CR) 10Andrew Bogott: "We're going to merge this on Tuesday when people will be around to watch it go." [puppet] - 10https://gerrit.wikimedia.org/r/472720 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [21:46:39] !log repooled wdqs1004 - looks like other servers feel worse so probably makes sense to share the load equally [21:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:09] (03PS1) 10Mobrovac: Proton: Enable cancellable promises [puppet] - 10https://gerrit.wikimedia.org/r/472735 (https://phabricator.wikimedia.org/T204055) [21:47:43] (03CR) 10jerkins-bot: [V: 04-1] Proton: Enable cancellable promises [puppet] - 10https://gerrit.wikimedia.org/r/472735 (https://phabricator.wikimedia.org/T204055) (owner: 10Mobrovac) [21:49:09] (03PS2) 10Mobrovac: Proton: Enable cancellable promises [puppet] - 10https://gerrit.wikimedia.org/r/472735 (https://phabricator.wikimedia.org/T204055) [21:50:46] 10Operations, 10Wikimedia-Logstash, 10User-herron: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173 (10herron) [21:51:19] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler1002/13425/" [puppet] - 10https://gerrit.wikimedia.org/r/472735 (https://phabricator.wikimedia.org/T204055) (owner: 10Mobrovac) [21:52:14] (03PS3) 10Andrew Bogott: role::labs::instance: Remove $::virtual == 'kvm' check for promethium [puppet] - 10https://gerrit.wikimedia.org/r/470100 (owner: 10Alex Monk) [21:54:00] (03CR) 10Andrew Bogott: [C: 032] role::labs::instance: Remove $::virtual == 'kvm' check for promethium [puppet] - 10https://gerrit.wikimedia.org/r/470100 (owner: 10Alex Monk) [21:58:40] PROBLEM - Check size of conntrack table on rdb2004 is CRITICAL: Return code of 255 is out of bounds [21:58:59] (03PS1) 10Bstorm: sonofgridengine: fix name conflict for checkpoint [puppet] - 10https://gerrit.wikimedia.org/r/472737 (https://phabricator.wikimedia.org/T200557) [21:59:31] 10Operations, 10Mail: Create affcom-staff email account - https://phabricator.wikimedia.org/T176153 (10herron) Was this figured out? [22:00:30] PROBLEM - Check systemd state on rdb2004 is CRITICAL: Return code of 255 is out of bounds [22:00:30] PROBLEM - configured eth on rdb2004 is CRITICAL: Return code of 255 is out of bounds [22:02:20] PROBLEM - Check the NTP synchronisation status of timesyncd on rdb2004 is CRITICAL: Return code of 255 is out of bounds [22:02:20] PROBLEM - dhclient process on rdb2004 is CRITICAL: Return code of 255 is out of bounds [22:02:40] I am working on this host [22:03:49] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: hotmail users are reporting mailman issues - https://phabricator.wikimedia.org/T127008 (10herron) 05Open>03Resolved a:03herron This sounds DMARC related. Please see https://wikitech.wikimedia.org/wiki/Mailman#DMARC_Compatibility Specifically: ` List a... [22:04:32] (03CR) 10Bstorm: [C: 032] sonofgridengine: fix name conflict for checkpoint [puppet] - 10https://gerrit.wikimedia.org/r/472737 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:07:55] (03PS1) 10Bstorm: sonofgridengine: add the missing ckpt template [puppet] - 10https://gerrit.wikimedia.org/r/472740 (https://phabricator.wikimedia.org/T200557) [22:08:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 219.89 seconds [22:12:31] RECOVERY - dhclient process on rdb2004 is OK: PROCS OK: 0 processes with command name dhclient [22:12:50] RECOVERY - Check systemd state on rdb2004 is OK: OK - running: The system is fully operational [22:12:50] RECOVERY - configured eth on rdb2004 is OK: OK - interfaces up [22:12:51] RECOVERY - Check size of conntrack table on rdb2004 is OK: OK: nf_conntrack is 0 % full [22:19:06] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2004.codfw.wmnet'] ` and were **ALL** successful. [22:19:27] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2003.codfw.wmnet'] ` and were **ALL** successful. [22:23:15] (03CR) 10Bstorm: [C: 032] sonofgridengine: add the missing ckpt template [puppet] - 10https://gerrit.wikimedia.org/r/472740 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:25:35] (03PS1) 10Bstorm: sonofgridengine: correct typo in template [puppet] - 10https://gerrit.wikimedia.org/r/472743 (https://phabricator.wikimedia.org/T200557) [22:27:20] (03PS1) 10Zoranzoki21: Enable autopatroller, patroller and rollbacker rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 [22:27:43] (03CR) 10Bstorm: [C: 032] sonofgridengine: correct typo in template [puppet] - 10https://gerrit.wikimedia.org/r/472743 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:32:19] RECOVERY - Check the NTP synchronisation status of timesyncd on rdb2004 is OK: OK: synced at Fri 2018-11-09 22:32:12 UTC. [22:38:01] (03PS1) 10Zoranzoki21: Disable FlaggedRevs on srwikinews and enable standard option for patrolling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 [22:53:59] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 625 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 30 days 8 hours - replication_delay is 625 [22:54:10] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 30 days 8 hours - replication_delay is 635 [22:54:29] PROBLEM - Check health of redis instance on 6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 30 days 8 hours - replication_delay is 651 [22:54:29] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 657 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 30 days 8 hours - replication_delay is 657 [22:58:58] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472605 (owner: 10Jforrester) [23:07:32] (03PS3) 10Dzahn: icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 [23:08:25] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (owner: 10Dzahn) [23:17:01] (03PS4) 10Dzahn: icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [23:17:21] ACKNOWLEDGEMENT - Check health of redis instance on 6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 2003 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 30 days 8 hours - replication_delay is 2003 daniel_zahn https://phabricator.wikimedia.org/T206450 [23:48:00] (03PS5) 10Dzahn: icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [23:48:49] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [23:49:30] (03CR) 10Dzahn: "ok, it's Python 3 now" [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [23:51:30] (03PS6) 10Dzahn: icinga/planet: add plugin to check planet content updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208)