[00:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T0000). [00:40:47] (03CR) 10BryanDavis: [C: 03+1] puppet_alert: Email projectadmins instead of members [puppet] - 10https://gerrit.wikimedia.org/r/495757 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [00:43:39] 10Operations, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked): Set warning thresholds for average cluster utilization - https://phabricator.wikimedia.org/T76306 (10mobrovac) 05Open→03Invalid This task is obsolete given the move to to #kubernetes . [00:47:20] PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:47:51] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Blocked Externally), 10Patch-For-Review, and 2 others: Setup automated topk wide row reporting - https://phabricator.wikimedia.org/T147366 (10mobrovac) 05Open→03Declined We don't need this any more, so declining. [00:48:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:49:36] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked), 10User-Eevans: puppetize turning off reserved space for cassandra /srv - https://phabricator.wikimedia.org/T132632 (10mobrovac) @Eevans @fgiunchedi should we go ahead with this? [00:52:06] RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [00:53:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:55:52] 10Operations, 10Electron-PDFs, 10Core Platform Team Backlog (Attic), 10Services (attic): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 (10mobrovac) [01:15:34] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:17:40] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:18:08] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:18:50] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:19:18] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:20:18] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:24:52] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:27:16] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:36:36] PROBLEM - Host ms-be2037 is DOWN: PING CRITICAL - Packet loss = 100% [02:14:32] (03PS4) 10KartikMistry: Enable ExternalGuidance to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129) [02:41:35] (03PS3) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [02:44:48] (03PS4) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [02:45:47] (03CR) 10jerkins-bot: [V: 04-1] openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [02:46:25] (03PS5) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [02:46:57] (03CR) 10jerkins-bot: [V: 04-1] openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [02:48:00] (03PS6) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [02:48:37] (03CR) 10jerkins-bot: [V: 04-1] openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [02:52:13] (03PS7) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [02:53:08] (03CR) 10jerkins-bot: [V: 04-1] openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [03:02:49] (03PS8) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:04:31] (03PS9) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:08:26] (03PS1) 10Andrew Bogott: ldap: added a couple more dummy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/496365 [03:08:47] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] ldap: added a couple more dummy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/496365 (owner: 10Andrew Bogott) [03:15:51] (03PS10) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:19:24] (03PS1) 10Andrew Bogott: Added a dummy password for compiler testing [labs/private] - 10https://gerrit.wikimedia.org/r/496366 [03:20:40] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added a dummy password for compiler testing [labs/private] - 10https://gerrit.wikimedia.org/r/496366 (owner: 10Andrew Bogott) [03:35:27] (03PS11) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:36:25] (03CR) 10jerkins-bot: [V: 04-1] openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [03:37:58] (03PS12) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:39:43] (03PS13) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:41:28] (03PS14) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:49:14] (03PS15) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [03:58:42] (03PS16) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [04:00:04] kart_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T0400). [04:00:26] !log Started manual run of unpublished ContentTranslation draft purge script (T217818) [04:00:44] (03PS1) 10Andrew Bogott: openldap: a couple more dummy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/496368 [04:00:59] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] openldap: a couple more dummy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/496368 (owner: 10Andrew Bogott) [04:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:07] T217818: Run unpublished draft purge script for CX (Week of 03/10) - https://phabricator.wikimedia.org/T217818 [04:03:24] (03PS17) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [04:06:27] (03CR) 10Andrew Bogott: "17 patchsets in, I'm finally happy with the diff. Pretty much a no-op on serpens and seaborgium, and adds monitoring on labtest2001." [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [05:00:52] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10bd808) I switched the Toolforge PHP 7.2 image: https://gerrit.wikimedia.org/r/#/c/operations/docker-images/toollabs-images/+/496102/ [05:16:58] (03CR) 10BryanDavis: [C: 03+1] "tesseract-ocr-all is a *lot* more packages. My count at https://packages.debian.org/stretch-backports/tesseract-ocr-all is 161 versus the " [puppet] - 10https://gerrit.wikimedia.org/r/496008 (https://phabricator.wikimedia.org/T218151) (owner: 10Tpt) [05:36:57] (03CR) 10BryanDavis: [C: 03+1] "Seems fine to me. The bug referenced is a bit recursive. Maybe we should at least get someone from Security to post a note on the bug agre" [puppet] - 10https://gerrit.wikimedia.org/r/496063 (https://phabricator.wikimedia.org/T218165) (owner: 10MarcoAurelio) [06:00:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496369 [06:01:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496369 (owner: 10Marostegui) [06:02:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496369 (owner: 10Marostegui) [06:04:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098 (duration: 00m 56s) [06:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:05] !log Upgrade MySQL on db1098 [06:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:10] 10Operations, 10ops-eqiad, 10DBA: dbproxy1012 power supply without power - https://phabricator.wikimedia.org/T217394 (10Marostegui) Thank you! ` properties CreationTimestamp = 20190307150222.000000-360 ElementName = System Event Log Entry RecordData = The input power for power supply 2 has been restor... [06:08:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496369 (owner: 10Marostegui) [06:08:51] !log Finished manual run of unpublished ContentTranslation draft purge script (T217818) [06:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:54] T217818: Run unpublished draft purge script for CX (Week of 03/10) - https://phabricator.wikimedia.org/T217818 [06:18:52] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1097:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496370 [06:19:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1097:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496370 (owner: 10Marostegui) [06:20:47] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1097:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496370 (owner: 10Marostegui) [06:21:00] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1097:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496370 (owner: 10Marostegui) [06:21:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1098:3317 (duration: 00m 55s) [06:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:27] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10daniel) Accepted as an RFC [06:25:51] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496371 [06:32:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496371 (owner: 10Marostegui) [06:32:55] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496371 (owner: 10Marostegui) [06:33:18] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cronUtils.sh] [06:34:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1098:3317 (duration: 00m 55s) [06:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:51] (03PS1) 10Marostegui: db-eqiad.pph: More traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496372 [06:39:28] (03PS2) 10Marostegui: db-eqiad.php: More traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496372 [06:40:07] !log Upgrade mysql on dbstore2002 [06:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:22] 10Operations, 10Analytics, 10EventBus, 10Prod-Kubernetes, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10akosiaris) Do we have logs of this happening? [06:42:53] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496371 (owner: 10Marostegui) [06:46:47] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging /home/akosiaris/deployment-charts/charts/cxserver/ [namespace: cxserver, clusters: staging] [06:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:48] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [06:46:48] !log akosiaris@deploy1001 scap-helm cxserver finished [06:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:49] (03PS3) 10Alexandros Kosiaris: Fix typo, bump cxserver chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/496195 [06:48:17] (03PS3) 10Marostegui: db-eqiad.php: More traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496372 [06:49:41] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496372 (owner: 10Marostegui) [06:50:39] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496372 (owner: 10Marostegui) [06:50:54] !log marostegui@deploy1001 sync-file aborted: More traffic to db1097 (duration: 00m 00s) [06:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1097 (duration: 00m 55s) [06:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1098 (duration: 00m 54s) [06:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:30] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496374 [06:54:06] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496372 (owner: 10Marostegui) [06:59:16] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:37] (03PS4) 10Alexandros Kosiaris: Fix typo, bump cxserver chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/496195 [07:01:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This misses the new .tgz (yes we should automate that)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/496271 (owner: 10Ottomata) [07:01:52] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix typo, bump cxserver chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/496195 (owner: 10Alexandros Kosiaris) [07:05:00] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496374 (owner: 10Marostegui) [07:05:58] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496374 (owner: 10Marostegui) [07:06:11] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496374 (owner: 10Marostegui) [07:07:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1098 (duration: 00m 55s) [07:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:41] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496375 [07:14:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Service name and IPs for ldap-behind-lvs (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/496007 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [07:15:58] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [07:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:59] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [07:15:59] !log akosiaris@deploy1001 scap-helm cxserver finished [07:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:09] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [07:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:10] !log akosiaris@deploy1001 scap-helm cxserver cluster eqiad completed [07:16:11] !log akosiaris@deploy1001 scap-helm cxserver finished [07:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:15] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [07:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:17] !log akosiaris@deploy1001 scap-helm cxserver cluster codfw completed [07:16:17] !log akosiaris@deploy1001 scap-helm cxserver finished [07:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:50] !log kartik@deploy1001 Started deploy [cxserver/deploy@3ba57a5]: Update cxserver to b16f4a1 (T212577, T208386) [07:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:55] T208386: MT error while translating the big reflist section - https://phabricator.wikimedia.org/T208386 [07:18:55] T212577: Simplify default MT config - https://phabricator.wikimedia.org/T212577 [07:19:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496375 (owner: 10Marostegui) [07:20:05] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496375 (owner: 10Marostegui) [07:21:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1098 (duration: 00m 55s) [07:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496376 [07:22:40] !log kartik@deploy1001 Finished deploy [cxserver/deploy@3ba57a5]: Update cxserver to b16f4a1 (T212577, T208386) (duration: 03m 50s) [07:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:55] akosiaris: cxserver is all yours now :) [07:24:02] kart_: thanks! [07:24:14] I should start directing traffic to kubernetes pretty soon [07:24:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496376 (owner: 10Marostegui) [07:24:25] hopefully by Tuesday morning we will be done [07:25:11] cool! [07:25:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496376 (owner: 10Marostegui) [07:26:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 (duration: 00m 54s) [07:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:29] Fatal errors.. :( [07:27:44] How come I keep getting those sometimes with Special:Nuke on English? [07:27:48] "PHP fatal error: [07:27:48] entire web request took longer than 60 seconds and timed out " [07:27:57] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496375 (owner: 10Marostegui) [07:27:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496376 (owner: 10Marostegui) [07:34:31] (03Abandoned) 10TerraCodes: Enable local uploads for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390303 (https://phabricator.wikimedia.org/T166763) (owner: 10TerraCodes) [07:38:12] Bsadowski1: that is a protection against trying to do very heavy operations so that the wikis don't get overloaded [07:38:29] Ah okay [07:38:33] Bsadowski1: if it happens more than once [07:38:47] (it would be normal to happen 1 and then you can retry) [07:39:03] I would file a ticket against Extension:Nuke [07:39:22] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I just noticed the change to the liveness probe. Tests have concluded that using an HTTP get as a liveness will cause major disruptions wh" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496271 (owner: 10Ottomata) [07:39:46] Bsadowski1: it could be it is already filed, I see https://phabricator.wikimedia.org/T212690 [07:40:34] I think the solution to that problem would be https://phabricator.wikimedia.org/T188679 [07:42:36] !log Upgrade db1088 [07:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:47] (03CR) 10ArielGlenn: [C: 03+2] Fix README typo, clarify variable naming [dumps/dcat] - 10https://gerrit.wikimedia.org/r/484011 (owner: 10Hoo man) [07:48:43] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [07:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:44] !log akosiaris@deploy1001 scap-helm cxserver cluster codfw completed [07:48:44] !log akosiaris@deploy1001 scap-helm cxserver finished [07:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:07] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496378 [07:51:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] "All traffic has been shifted, merging" [puppet] - 10https://gerrit.wikimedia.org/r/494213 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [07:51:53] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10Vgutierrez) a:05Nuria→03Vgutierrez [07:52:00] (03PS3) 10Alexandros Kosiaris: lvs: Use the kubernetes cluster for citoid [puppet] - 10https://gerrit.wikimedia.org/r/494213 (https://phabricator.wikimedia.org/T213194) [07:52:03] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] lvs: Use the kubernetes cluster for citoid [puppet] - 10https://gerrit.wikimedia.org/r/494213 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [07:52:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496378 (owner: 10Marostegui) [07:53:16] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496378 (owner: 10Marostegui) [07:54:03] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496378 (owner: 10Marostegui) [07:54:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1088 (duration: 00m 55s) [07:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:18] (03PS1) 10Vgutierrez: admin: create user with analytics-privatedata access for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/496379 (https://phabricator.wikimedia.org/T217438) [08:02:27] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496380 [08:08:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496380 (owner: 10Marostegui) [08:09:32] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496380 (owner: 10Marostegui) [08:10:18] (03PS3) 10Alexandros Kosiaris: citoid: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/494214 (https://phabricator.wikimedia.org/T213194) [08:10:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] citoid: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/494214 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [08:10:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1088 (duration: 00m 55s) [08:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:41] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10elukey) [08:16:42] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496380 (owner: 10Marostegui) [08:21:03] (03PS1) 10Alexandros Kosiaris: Send traffic for cxserver to kubernetes hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/496381 (https://phabricator.wikimedia.org/T213195) [08:21:08] (03PS1) 10Alexandros Kosiaris: lvs: Use the kubernetes cluster for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) [08:21:10] (03PS1) 10Alexandros Kosiaris: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/496383 (https://phabricator.wikimedia.org/T213195) [08:21:46] !log Upgrade s3 codfw master (db2043) there will be lag on s3 codfw [08:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:44] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496384 [08:24:54] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496384 (owner: 10Marostegui) [08:25:52] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496384 (owner: 10Marostegui) [08:26:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1088 (duration: 00m 53s) [08:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:44] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496384 (owner: 10Marostegui) [08:30:31] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider raising Memcached MWObject cache memory size limit - https://phabricator.wikimedia.org/T217731 (10elukey) >>! In T217731#5021033, @aaron wrote: > Can the (extra) space be dedicated more so towards the... [08:33:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 on premise, with the caveat antoine already mentioned" [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [08:37:29] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) >>! In T213089#4940510, @elukey wrote: > EDIT: after a chat with upstream it was suggested to me to follow up with Debi... [08:39:11] (03PS2) 10Dzahn: doc/dumps::web::htmldumps: remove diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/496149 [08:42:49] (03CR) 10Dzahn: [C: 03+2] doc/dumps::web::htmldumps: remove diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/496149 (owner: 10Dzahn) [08:44:04] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Leaving here also a reference of https://github.com/memcached/memcached/issues/359: > Regression in systemd-based sandb... [08:44:41] !log Deploy schema change on s4 codfw master (db2051), this will generate lag on codfw [08:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:22] (03PS1) 10Dzahn: loggin::webrequest::ops: remove diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/496385 [08:49:05] (03PS2) 10Dzahn: loggin::webrequest::ops: remove diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/496385 (https://phabricator.wikimedia.org/T212231) [08:50:12] 10Operations, 10monitoring, 10Patch-For-Review: Remove Diamond from production - https://phabricator.wikimedia.org/T212231 (10Dzahn) remove from doc1001 and francium: https://gerrit.wikimedia.org/r/c/operations/puppet/+/496149 [08:50:38] (03PS3) 10Dzahn: logging::webrequest::ops: remove diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/496385 (https://phabricator.wikimedia.org/T212231) [08:50:50] (03CR) 10Dzahn: [C: 03+2] logging::webrequest::ops: remove diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/496385 (https://phabricator.wikimedia.org/T212231) (owner: 10Dzahn) [08:52:47] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/496133 (https://phabricator.wikimedia.org/T218185) (owner: 10GTirloni) [08:53:02] (03PS16) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [08:54:00] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:56:49] ACKNOWLEDGEMENT - nova-compute proc maximum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute daniel_zahn stalled on dc work per Andrew https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:02:28] !log ms-be2037 - down since a couple hours, no SAL or ticket, powercycling [09:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:30] RECOVERY - Host ms-be2037 is UP: PING WARNING - Packet loss = 61%, RTA = 0.69 ms [09:06:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This will probably break all labs VMs" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [09:07:53] ACKNOWLEDGEMENT - NFS on labstore1006 is CRITICAL: connect to address 208.80.154.7 and port 2049: Connection refused daniel_zahn https://phabricator.wikimedia.org/T217474 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [09:07:54] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:06] RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:15:15] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban: confirm gpu form factor in stat1005 - https://phabricator.wikimedia.org/T216528 (10elukey) @Cmjohnson do you have time today/tomorrow to answer Rob's question? It would unblock us to order the new GPU :) (sorry for the hassle with stat1005, we hop... [09:41:17] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10MoritzMuehlenhoff) >>! In T213089#5023411, @elukey wrote: > Leaving here also a reference of https://github.com/memcached/memcac... [09:46:28] (03PS7) 10Tim Eulitz: Set up exceptions for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) [09:46:39] (03PS5) 10Tim Eulitz: Add default user config for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495667 (https://phabricator.wikimedia.org/T217436) [09:54:50] !log ci: live hacked job https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/ in attempt to capture 'core' files from hhvm | https://gerrit.wikimedia.org/r/#/c/integration/config/+/496392/ | T216689 [09:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:53] T216689: Merge blocker: quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11) - https://phabricator.wikimedia.org/T216689 [09:58:01] (03PS1) 10Dzahn: mariadb: add Icinga notes URLs to monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/496393 [10:00:28] (03PS16) 10Elukey: hadoop: allow the configuration of ssl-(server|client).xml configs [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) [10:04:29] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15114/" [puppet] - 10https://gerrit.wikimedia.org/r/493693 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [10:14:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10aborrero) >>! In T215012#5021924, @Cmjohnson wrote: > @aborrero the CPU is here...let me know when it's safe for me... [10:17:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496394 [10:18:54] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496394 (owner: 10Marostegui) [10:20:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496394 (owner: 10Marostegui) [10:21:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 (duration: 00m 58s) [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496394 (owner: 10Marostegui) [10:37:51] (03PS1) 10Muehlenhoff: Add library hint for libsdl1.2 [puppet] - 10https://gerrit.wikimedia.org/r/496398 [10:39:50] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libsdl1.2 [puppet] - 10https://gerrit.wikimedia.org/r/496398 (owner: 10Muehlenhoff) [10:40:32] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Qgil) >>! In T52864#5022889, @Tgr wrote: > There has been a lot of activity on Discourse, OTOH. @Qgil might be able to say more on that. A few week... [10:44:59] !log installing libsdl1.2 security updates for jessie [10:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:15] (03CR) 10Addshore: [C: 03+1] Increased maxSerializedEntitySize from 2500 to 3000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496161 (https://phabricator.wikimedia.org/T217739) (owner: 10Mahveotm) [10:50:02] !log cp2002: pool varnish-fe to resume ATS testing T213263 [10:50:02] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10jbond) Thanks for the response In the last option the anycast prefix should get more then 50% of the traffic due to the SRTT algorithm mentioned by bblack bu... [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:04] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [10:50:21] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=nginx [10:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:22] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-fe [10:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:21] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496400 [10:55:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496400 (owner: 10Marostegui) [10:55:54] (03PS1) 10Elukey: Add TLS configuration to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/496401 (https://phabricator.wikimedia.org/T217412) [10:56:41] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496400 (owner: 10Marostegui) [10:57:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1081 (duration: 00m 57s) [10:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:04] zeljkof: Should I +2 wmf.21 patch I'm planning to deploy first? [10:59:48] kart_: yes, that one will take 10 or so minutes to merge [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1100). [11:00:04] kart_ and alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:05] if you have a config patch, you can deploy it while the first one is being merged [11:00:18] bah [11:00:23] zeljkof: sure. I've both :) [11:00:29] european slots are tied to pacific time?:(! [11:00:42] hashar: yes, all deployments [11:01:06] hashar: I've updated deployments google calendar this morning, it was locked to CET, now it's locked to PT [11:01:07] (03PS2) 10Elukey: Add TLS configuration to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/496401 (https://phabricator.wikimedia.org/T217412) [11:01:09] (03PS1) 10Elukey: profile::hadoop::common: fix a missing '.' [puppet] - 10https://gerrit.wikimedia.org/r/496402 (https://phabricator.wikimedia.org/T217412) [11:01:18] kart_: go ahead with your patches [11:01:25] alaa_wmde: around for SWAT? [11:01:28] zeljkof: sure. Going.. [11:02:12] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: fix a missing '.' [puppet] - 10https://gerrit.wikimedia.org/r/496402 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [11:02:45] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129) (owner: 10KartikMistry) [11:03:54] (03Merged) 10jenkins-bot: Enable ExternalGuidance to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129) (owner: 10KartikMistry) [11:04:59] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496400 (owner: 10Marostegui) [11:05:00] OK. Going with testing config patch first while wmf.21 patch is CI'ng. [11:05:01] (03CR) 10jenkins-bot: Enable ExternalGuidance to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493672 (https://phabricator.wikimedia.org/T216129) (owner: 10KartikMistry) [11:05:14] addshore: alaa_wmde scheduled this for swat, but they are not around https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/493011 [11:05:35] I see you're a reviewer [11:06:16] (03CR) 10Zfilipin: "This is scheduled for EU SWAT today, but the developer is not in #wikimedia-operations, so it will not be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [11:06:59] whois Amir1 [11:07:06] argh :) [11:07:33] Amir1: see ^ [11:08:49] jouncebot: now [11:08:49] For the next 0 hour(s) and 51 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1100) [11:09:06] zeljkof: I’ll ping Alaa in a few minutes (currently in a daily) [11:09:09] .. testing on mwdebug now .. [11:09:22] Lucas_WMDE: thanks! [11:09:35] (03PS1) 10Elukey: Add fake secrets for Hadoop Analytics Test [labs/private] - 10https://gerrit.wikimedia.org/r/496406 [11:09:37] (03PS1) 10Jbond: Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts [puppet] - 10https://gerrit.wikimedia.org/r/496405 (https://phabricator.wikimedia.org/T217646) [11:09:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secrets for Hadoop Analytics Test [labs/private] - 10https://gerrit.wikimedia.org/r/496406 (owner: 10Elukey) [11:11:17] (03PS1) 10Marostegui: db-eqiad.php: Give db1081 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496407 [11:11:26] zeljkof: hey, I don't remember adding this to SWAT [11:11:39] I think someone else added it [11:11:59] hi I'm here for testing musical notation [11:12:12] Amir1: yes, alaa_wmde added it, I just wanted to make sure you're aware it would not get deployed unless somebody was around to test it [11:12:29] we got caught by the DST jump :) [11:12:30] oh don't worry. I'm around to help [11:12:30] alaa_wmde: great, please stand by, you're next, after kart_ is finished [11:12:59] alaa_wmde: you're not a deployer, right? [11:13:00] huh, what’s this “Gerrit hashtag” on the deployments calendar? [11:13:19] Lucas_WMDE: twentyafterfour is working on automating swat, that's the first step [11:13:24] neato [11:13:27] sounds exciting [11:13:32] it is :) [11:13:46] Amir1: want to deploy alaa_wmde's patch? or should I do it? [11:13:51] zeljkof: automating? wow. [11:13:59] I can [11:14:03] kart_: well, _increasing_ automation :) [11:14:04] I'm not a deployer yet, right [11:14:16] Amir1: great, you're next then, after kart_ [11:14:21] (03PS1) 10Elukey: Add fake hieradata secrets for Hadoop Analytics Test [labs/private] - 10https://gerrit.wikimedia.org/r/496408 [11:14:37] my job is done here, developers deploying their patches [11:14:38] zeljkof: Deploying config patch now.. [11:14:42] * zeljkof rides into sunset [11:14:46] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake hieradata secrets for Hadoop Analytics Test [labs/private] - 10https://gerrit.wikimedia.org/r/496408 (owner: 10Elukey) [11:14:51] zeljkof: lol. [11:15:10] * zeljkof raises hat [11:15:18] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:493672]] Enable ExternalGuidance to all Wikipedias (T216129) (duration: 00m 57s) [11:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:21] T216129: External Guidance: Deployment on all Wikipedias (but mainly visible for English translations) - https://phabricator.wikimedia.org/T216129 [11:15:42] OK. Here goes wmf.21 patch. [11:15:55] zeljkof: anything special care for wmf.21 patches? [11:16:18] kart_: no, as far as I know. what do you mean? :) [11:16:47] zeljkof: OK. I'll ask if I'm confused. [11:17:03] kart_: please do, it's all documented, but could be confusing [11:17:24] deploying core/extension is slightly different than config [11:17:49] see https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins [11:17:56] and https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins_2 [11:18:47] zeljkof: I did git fetch and it shows.. [11:19:08] modified: ../extensions/Echo (new commits) [11:19:20] that's not good :) [11:19:26] yeah [11:19:35] maybe somebody forgot to deploy something? [11:19:56] can you paste the output? https://phabricator.wikimedia.org/paste/ [11:20:19] sure [11:20:28] (03PS1) 10Marostegui: wmnet: Depool dbproxy1010 [dns] - 10https://gerrit.wikimedia.org/r/496410 [11:20:54] (03CR) 10Marostegui: "Aiming to do it Monday or Tuesday next week" [dns] - 10https://gerrit.wikimedia.org/r/496410 (owner: 10Marostegui) [11:20:59] zeljkof: https://phabricator.wikimedia.org/P8199 [11:21:35] kart_: well, I guess you can ignore it, I'll take a look [11:21:45] kart_: why are you doing a fetch in /skins? [11:22:03] zeljkof: because it is skins patch? [11:22:08] ah :) [11:22:11] let me see the patch [11:22:25] ah, minerva [11:22:56] anyway, you should do a fetch in skins/MinervaNeue, not in skins/ [11:23:06] (03PS3) 10Elukey: Add TLS configuration to the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/496401 (https://phabricator.wikimedia.org/T217412) [11:23:31] kart_: `[extensions|skins]/[NAME]$ git status` [11:24:29] so: `cd /srv/mediawiki-staging/php-1.33.0-wmf.21/skins/MinervaNeue` [11:24:43] `git status; git fetch...` [11:25:19] OK. Cool. [11:26:22] (03CR) 10Elukey: [C: 03+2] "From https://puppet-compiler.wmflabs.org/compiler1002/15119/ it looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/496401 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [11:26:41] kart_: looks like this is the Echo patch that didn't get deployed https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/496080 [11:27:06] ah. [11:27:16] zeljkof: will if affect? [11:27:26] Krinkle, kostajh: looks like this is fetches at deploy1001? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/496080 [11:28:45] _fetched_ (but not rebased?) [11:29:26] (03PS2) 10Marostegui: wmnet: Depool dbproxy1010 [dns] - 10https://gerrit.wikimedia.org/r/496410 [11:29:55] zeljkof: ah. There is git rebase step missing in skins instruction? :) [11:30:09] kart_: no, the instructions are correct [11:30:33] I'm just looking at them, `submodule update` is how it's merged [11:30:45] OK. [11:31:32] zeljkof: AFAIK, last 3 steps from extensions are also needed for skins? [11:31:53] kart_: yes, the steps are the same for extensions and skins [11:33:35] OK. Doing that. [11:34:40] (03PS1) 10Santhosh: Correct the enable context detection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496412 [11:36:35] (03PS2) 10Santhosh: Correct the enable context detection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496412 [11:39:59] OK. We also need above patch, but now deploying skin fix first. [11:41:23] zeljkof: Q: Do I need to sync file changed or entire extension? [11:41:59] kart_: I usually sync the entire skin/extension [11:42:09] OK [11:42:15] you can sync file/folder if you prefer, but I didn't notice any speed improvement [11:44:23] zeljkof: so for entire folder command is sync-folder? [11:44:36] kart_: no, the same command for all [11:44:43] OK [11:44:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/496405 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [11:45:00] kart_: see https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins_2 [11:45:01] (03CR) 10KartikMistry: [C: 03+2] Correct the enable context detection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496412 (owner: 10Santhosh) [11:45:13] scap sync-file ... [11:45:20] Got it. Started. [11:45:44] !log kartik@deploy1001 Synchronized php-1.33.0-wmf.21/skins/MinervaNeue: SWAT: [[gerrit:496364|Ensure page-actions icons are `display:block` (T218182) (duration: 00m 57s) [11:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:47] T218182: Mobile site shows icons with wrong size when accessed in Google translate - https://phabricator.wikimedia.org/T218182 [11:46:14] zeljkof: I've another hot fix for config. [11:46:21] (03Merged) 10jenkins-bot: Correct the enable context detection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496412 (owner: 10Santhosh) [11:46:38] kart_: ok [11:47:12] (03PS1) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [11:48:59] (03PS2) 10Jbond: Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts [puppet] - 10https://gerrit.wikimedia.org/r/496405 (https://phabricator.wikimedia.org/T217646) [11:50:41] zeljkof: Last question. For more than one file we do scap sync-file file1 file2 file3 ? [11:50:49] (03CR) 10jenkins-bot: Correct the enable context detection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496412 (owner: 10Santhosh) [11:50:55] kart_: no! [11:50:57] :) [11:51:10] you can sync a file or a folder [11:51:14] which patch is it? [11:51:16] OK. [11:51:47] (03CR) 10Jbond: [C: 03+2] Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts [puppet] - 10https://gerrit.wikimedia.org/r/496405 (https://phabricator.wikimedia.org/T217646) (owner: 10Jbond) [11:51:47] zeljkof: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/496412/ [11:52:10] zeljkof: OK. It seems we need to revert it. It seems not needed. [11:52:15] kart_: sync wmf-config folder [11:52:20] zeljkof: can you do that quickly? [11:52:28] kart_: do what? [11:52:44] revert. I merged and rebased that patch on deploy1001 [11:53:57] kart_: which patch? [11:54:04] why don't you do it? [11:54:11] zeljkof: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/496412/ [11:54:22] zeljkof: No effect of it, basically. [11:54:30] Or let me deploy it :) [11:55:16] kart_: I'm confused? why can't you revert and deploy? [11:55:32] ok. Let's go simple way. [11:55:38] Deploy and fix later. [11:55:43] alaa_wmde, Amir1: looks like we ran out of time today :( is your patch urgent? [11:56:01] not urgent no [11:56:11] we haven't announced yet to the community either [11:56:11] oops! [11:56:13] it's fine for me [11:56:52] alaa_wmde, Amir1: please schedule it for another swat than, and thanks :) [11:57:10] yup will do right away [11:57:18] great! [11:57:21] thanks @zeljkof [11:57:34] alaa_wmde: I'm sorry. Took more time than estimated. [11:57:48] zeljkof: scap'ng last patch. [11:57:55] kart_ is new to deployments, still practicing, it happens :) [11:58:03] !log kartik@deploy1001 scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [11:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:14] zeljkof: now this is new. [11:58:35] @kart_ no worries at all! gd luck with it [11:58:40] kart_: uh oh, so something _did_ go wrong [11:59:11] `Notice: Undefined variable: wmgExternalGuidanceEnableContextDetection in /srv/mediawiki/wmf-config/CommonSettings.php on line 3162` [11:59:15] zeljkof: so, we need to revert this. [11:59:30] zeljkof: let me do that, if needed I'll ask for help. [11:59:33] kart_: since swat is almost done, you should just revert everything [11:59:36] (03CR) 10Effie Mouzeli: [C: 03+2] Send traffic for cxserver to kubernetes hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/496381 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [11:59:38] kart_: ok, I'm around [11:59:43] OK [12:00:02] (03PS2) 10Effie Mouzeli: Send traffic for cxserver to kubernetes hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/496381 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1200) [12:01:42] zeljkof: I can push revert patch via Gerrit too instead of command line, right? [12:02:05] !log kartik@deploy1001 Synchronized wmf-config: SWAT: Revert [[gerrit:496412]] Fix content detection config (duration: 00m 56s) [12:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:55] kart_: I'm not sure what you ask [12:03:06] you can revert in terminal, then push to gerrit [12:03:15] or you can revert in gerrit, then pull from terminal [12:03:38] (03PS1) 10Alexandros Kosiaris: k8s: Remove old unused $master_ip var [puppet] - 10https://gerrit.wikimedia.org/r/496416 [12:04:23] zeljkof: It always fails for me via https from commandline. [12:04:30] So, I'll just use Gerrit web [12:04:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Remove old unused $master_ip var [puppet] - 10https://gerrit.wikimedia.org/r/496416 (owner: 10Alexandros Kosiaris) [12:05:10] (03PS1) 10KartikMistry: Revert "Correct the enable context detection configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496418 [12:05:25] zeljkof: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/496418/ [12:05:45] (03PS20) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [12:05:45] kart_: how does it fail? can you paste the output? [12:06:05] !log T216497 drop some packages from jessie-wikimedia/openstack-mtiaka-jessie: libvirt*, librados2, librbd1, because they induce the resolver to conflict with those included in stretch [12:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:08] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:06:26] zeljkof: this time, authentication failure :/ [12:06:36] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: drop more packages from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496420 (https://phabricator.wikimedia.org/T216497) [12:06:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:06:57] (03CR) 10Jcrespo: "This version doesn't give the exception, but it hits an exclusive process lock." [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:07:22] zeljkof: will send it. I should +2 config file and git fetch to deploy1001, right? [12:08:11] (03CR) 10KartikMistry: [C: 03+2] Revert "Correct the enable context detection configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496418 (owner: 10KartikMistry) [12:08:46] kart_: yes, the usual config deploy [12:08:53] cool. [12:09:09] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: drop more packages from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496420 (https://phabricator.wikimedia.org/T216497) [12:09:20] (03Merged) 10jenkins-bot: Revert "Correct the enable context detection configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496418 (owner: 10KartikMistry) [12:12:24] !log T216497 drop some packages from jessie-wikimedia/openstack-mtiaka-jessie: qemu-XXX [12:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:27] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:13:06] !log kartik@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:496418]] Revert "Correct the enable context detection configuration" (duration: 00m 56s) [12:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:16] zeljkof: done. [12:13:40] so, I've done more reverts than deploy? :) [12:13:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop more packages from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496420 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [12:14:25] (03CR) 10jenkins-bot: Revert "Correct the enable context detection configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496418 (owner: 10KartikMistry) [12:14:30] kart_: that's how it goes sometimes :) [12:14:37] kart_: so all done with swat? [12:14:44] yes zeljkof [12:14:52] !log EU SWAT finished [12:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:35] !log Send ~4% of cxserver traffic to eqiad k8s - T213195 [12:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:38] T213195: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 [12:21:42] !log jiji@cumin1001 conftool action : set/weight=1; selector: dc=eqiad,service=cxserver,cluster=scb,name=kubernetes.* [12:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:13] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=cxserver,cluster=scb,name=kubernetes.* [12:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:46] (03CR) 10Volans: [C: 03+1] "Change looks sane, thanks for the fixes." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494956 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [12:22:50] (03PS1) 10Elukey: hadoop::ssl_config: separate ssl client/server configs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/496426 (https://phabricator.wikimedia.org/T217412) [12:31:01] (03PS2) 10Elukey: hadoop::ssl_config: separate ssl client/server configs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/496426 (https://phabricator.wikimedia.org/T217412) [12:33:13] (03CR) 10Elukey: [C: 03+2] hadoop::ssl_config: separate ssl client/server configs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/496426 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [12:33:42] (03PS21) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [12:34:27] (03CR) 10Jcrespo: "Reverting, and now using processes instead, which seems to work (I was able to backup and finish test backup while x1 is still ongoing)." [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:34:32] (03PS1) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/496428 (https://phabricator.wikimedia.org/T217412) [12:34:46] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:38:13] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15123/" [puppet] - 10https://gerrit.wikimedia.org/r/496428 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [12:42:46] !log Rump up k8s cxserver traffic to 8% - T213195 [12:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:49] T213195: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 [12:49:49] !log jiji@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=cxserver,cluster=scb,name=kubernetes.* [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:02] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=cxserver,cluster=scb,name=kubernetes.* [12:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:47] (03CR) 10Volans: "The puppet side look sane, a couple of comments inline. I skipped for now the python side of it." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:57:33] (03CR) 10Jcrespo: "All great suggestions, thanks!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:59:07] (03PS1) 10Elukey: profile::hadoop::firewall::master: break down ssl config [puppet] - 10https://gerrit.wikimedia.org/r/496430 (https://phabricator.wikimedia.org/T217412) [13:00:04] zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1300). [13:03:12] o/ [13:03:18] * zeljkof train-ing [13:03:30] 10Operations, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10jijiki) [13:04:44] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15124/" [puppet] - 10https://gerrit.wikimedia.org/r/496430 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [13:07:18] (03PS1) 10Zfilipin: all wikis to 1.33.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496432 [13:07:20] (03CR) 10Zfilipin: [C: 03+2] all wikis to 1.33.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496432 (owner: 10Zfilipin) [13:08:26] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496432 (owner: 10Zfilipin) [13:10:01] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.21 [13:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:37] !log T216497 drop python-mysqldb from jessie-wikimedia/openstack-mtiaka-jessie [13:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:40] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [13:10:56] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:36] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496432 (owner: 10Zfilipin) [13:11:47] this was me killing puppet agent while manually running --^ [13:11:48] fixing now [13:12:52] (03PS2) 10Marostegui: db-eqiad.php: Give db1081 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496407 [13:13:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give db1081 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496407 (owner: 10Marostegui) [13:15:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1081 into API (duration: 00m 49s) [13:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:38] !log T216497 drop libpulse0 from jessie-wikimedia/openstack-mtiaka-jessie [13:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:42] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [13:16:08] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:20:29] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: drop even more packages from jessie-wikimedia/openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496433 (https://phabricator.wikimedia.org/T216497) [13:21:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop even more packages from jessie-wikimedia/openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496433 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [13:22:54] (03CR) 10jenkins-bot: db-eqiad.php: Give db1081 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496407 (owner: 10Marostegui) [13:28:30] (03PS4) 10Dzahn: lvs/icinga/services: add notes_urls for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/495009 [13:28:38] (03PS18) 10Alexandros Kosiaris: Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [13:28:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [13:30:03] (03Abandoned) 10Dzahn: lvs/icinga/services: add notes_urls for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/495009 (owner: 10Dzahn) [13:30:46] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496434 [13:34:33] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496434 (owner: 10Marostegui) [13:35:39] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496434 (owner: 10Marostegui) [13:36:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1081 into API (duration: 00m 48s) [13:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:05] 10Operations, 10Analytics, 10EventBus, 10Prod-Kubernetes, and 2 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.03.13/eventgate?id=AWl4sxguNBo9dX1kfcii&_... [13:39:55] (03PS1) 10Dzahn: service: add Icinga notes URL in defined types [puppet] - 10https://gerrit.wikimedia.org/r/496435 [13:41:16] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1081 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496434 (owner: 10Marostegui) [13:41:41] (03CR) 10Dzahn: "let's also add "diamond::remove: true" to the yaml files per https://debmonitor.wikimedia.org/packages/diamond / https://gerrit.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [13:42:41] (03CR) 10Dzahn: "nevermind, you did in the common file :)" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [13:43:07] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Ottomata) [13:44:09] (03PS2) 10Dzahn: service: add Icinga notes URL in defined types [puppet] - 10https://gerrit.wikimedia.org/r/496435 (https://phabricator.wikimedia.org/T197873) [13:44:16] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) After focused code reading and head scratching it turns out the root cause is that persisted metrics weren't sorted during migration, thi... [13:44:21] (03PS1) 10Arturo Borrero Gonzalez: Revert "openstack: nova: mitaka: stretch: install python-dogpile.core from jessie" [puppet] - 10https://gerrit.wikimedia.org/r/496436 (https://phabricator.wikimedia.org/T216497) [13:45:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "This has been reverted: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496436/" [puppet] - 10https://gerrit.wikimedia.org/r/482013 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [13:45:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "openstack: nova: mitaka: stretch: install python-dogpile.core from jessie" [puppet] - 10https://gerrit.wikimedia.org/r/496436 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [13:46:23] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked), 10User-Eevans: puppetize turning off reserved space for cassandra /srv - https://phabricator.wikimedia.org/T132632 (10Eevans) >>! In T132632#5023075, @mobrovac wrote: > @Eevans @fgiunchedi should we go ahea... [13:46:56] (03PS1) 10Dzahn: graphite: add Icinga notes_url for graphite_freshness check [puppet] - 10https://gerrit.wikimedia.org/r/496437 [13:50:06] !log reimaging cloudvirt1015 [13:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:18] PROBLEM - Check correctness of the icinga configuration on icinga2001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [13:50:55] (03PS1) 10Dzahn: monitoring: link to dcops runbook page for mgmt interface checks [puppet] - 10https://gerrit.wikimedia.org/r/496438 [13:51:02] oh.. i will look at the icinga config alert [13:51:45] (03CR) 10MarcoAurelio: "> Seems fine to me. The bug referenced is a bit recursive. Maybe we" [puppet] - 10https://gerrit.wikimedia.org/r/496063 (https://phabricator.wikimedia.org/T218165) (owner: 10MarcoAurelio) [13:51:53] Could not find any hostgroup matching 'sessions_codfw' [13:54:19] !log take a snapshot of data on prometheus2004 [13:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:38] (03CR) 10Dzahn: "Icinga has a config issue alert .. not finding the hostgroup "sessions_codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [13:55:13] (03CR) 10Ottomata: "Interesting. So yesterday there were pods that couldn't be reached by mediawiki:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/496271 (owner: 10Ottomata) [13:56:35] (03CR) 10Marostegui: [C: 03+1] "This looks good: https://puppet-compiler.wmflabs.org/compiler1002/15127/" [puppet] - 10https://gerrit.wikimedia.org/r/496393 (owner: 10Dzahn) [14:01:23] * apergos peeks in [14:01:56] (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/496441 [14:04:16] (03CR) 10Eevans: [C: 03+1] "> Icinga has a config issue alert .. not finding the hostgroup" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [14:08:30] (03PS1) 10Dzahn: Icinga: add host groups for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/496445 (https://phabricator.wikimedia.org/T215883) [14:09:03] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [14:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:07] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/496445" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [14:09:08] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:09:09] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] (03PS2) 10Dzahn: Icinga: add host groups for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/496445 (https://phabricator.wikimedia.org/T215883) [14:10:54] (03PS7) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [14:11:45] (03CR) 10Andrew Bogott: "Argh, you're right. I guess I'll just leave those bits in for now until I feel like refactoring all the ldap users :(" [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [14:12:27] (03PS18) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [14:13:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] Icinga: add host groups for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/496445 (https://phabricator.wikimedia.org/T215883) (owner: 10Dzahn) [14:13:08] (03PS1) 10Ppchelko: [EventBus] Decrease timeout and use hasty mode for analytics. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496446 (https://phabricator.wikimedia.org/T218260) [14:14:27] (03CR) 10Paladox: [C: 03+2] Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/496441 (owner: 10Paladox) [14:15:06] (03PS19) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [14:15:52] (03Abandoned) 10Ottomata: Use httpGet for liveness_probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/496271 (owner: 10Ottomata) [14:18:32] (03CR) 10Vgutierrez: "> Patch Set 7: Code-Review+1" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494956 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [14:19:21] (03PS1) 10Gehel: elasticsearch: create default elasticsearch config for all versions [puppet] - 10https://gerrit.wikimedia.org/r/496447 [14:21:07] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --set main_app.version=v1.0.3-wmf0 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [14:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:08] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:21:08] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] (03CR) 10DCausse: [C: 03+1] elasticsearch: create default elasticsearch config for all versions [puppet] - 10https://gerrit.wikimedia.org/r/496447 (owner: 10Gehel) [14:22:59] (03CR) 10Gehel: [C: 03+2] elasticsearch: create default elasticsearch config for all versions [puppet] - 10https://gerrit.wikimedia.org/r/496447 (owner: 10Gehel) [14:23:38] (03PS2) 10Dzahn: mariadb: add Icinga notes URLs to monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/496393 [14:24:44] PROBLEM - Check systemd state on sessionstore1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:25:25] (03CR) 10Dzahn: "Notice: /Stage[main]/Profile::Icinga/Monitoring::Group[sessions_codfw]/Nagios_hostgroup[sessions_codfw]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/496445 (https://phabricator.wikimedia.org/T215883) (owner: 10Dzahn) [14:27:52] (03PS8) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [14:28:42] PROBLEM - cassandra-a service on sessionstore1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:29:48] (03PS8) 10Andrew Bogott: Service name and IPs for ldap-behind-lvs [dns] - 10https://gerrit.wikimedia.org/r/496007 (https://phabricator.wikimedia.org/T218133) [14:30:40] PROBLEM - cassandra-a CQL 10.192.32.15:9042 on sessionstore2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [14:30:58] RECOVERY - Check correctness of the icinga configuration on icinga2001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:31:40] (03CR) 10Dzahn: "10:30 <+icinga-wm> RECOVERY - Check correctness of the icinga configuration on icinga2001 is OK: Icinga configuration is correct https://w" [puppet] - 10https://gerrit.wikimedia.org/r/496445 (https://phabricator.wikimedia.org/T215883) (owner: 10Dzahn) [14:32:43] (03PS9) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [14:33:16] (03CR) 10Ottomata: "I think this is the right thing to do (maybe timeout of 2 or 3 just in case?) but I'd like to see if we can figure out what's wrong with n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496446 (https://phabricator.wikimedia.org/T218260) (owner: 10Ppchelko) [14:33:56] PROBLEM - cassandra-a CQL 10.64.32.78:9042 on sessionstore1002 is CRITICAL: connect to address 10.64.32.78 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:34:10] ^ these are known.. new service being setup [14:34:23] sessionstore* that is [14:35:44] PROBLEM - cassandra-a SSL 10.64.32.78:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:35:48] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) Should've checked this but thumbor1004 is out of warranty. [14:36:50] (03CR) 10Dzahn: [C: 03+2] mariadb: add Icinga notes URLs to monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/496393 (owner: 10Dzahn) [14:37:08] (03PS10) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [14:37:12] PROBLEM - Check systemd state on sessionstore1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:37:36] PROBLEM - cassandra-a service on sessionstore1002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:38:41] mutante: Let me know when I can test it by stopping replication on a codfw host, so it will alert (only on IRC) [14:39:00] PROBLEM - Check systemd state on sessionstore1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:39:05] marostegui: ok [14:39:09] ACKNOWLEDGEMENT - Check systemd state on sessionstore1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T215883 [14:39:09] ACKNOWLEDGEMENT - Check systemd state on sessionstore1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T215883 [14:39:09] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.78:9042 on sessionstore1002 is CRITICAL: connect to address 10.64.32.78 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T215883 https://phabricator.wikimedia.org/T93886 [14:39:09] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.32.78:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T215883 https://phabricator.wikimedia.org/T120662 [14:39:09] ACKNOWLEDGEMENT - cassandra-a service on sessionstore1002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed daniel_zahn https://phabricator.wikimedia.org/T215883 [14:39:09] ACKNOWLEDGEMENT - Check systemd state on sessionstore1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T215883 [14:39:09] ACKNOWLEDGEMENT - cassandra-a service on sessionstore1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed daniel_zahn https://phabricator.wikimedia.org/T215883 [14:39:10] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.16.79:9042 on sessionstore2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T215883 https://phabricator.wikimedia.org/T93886 [14:39:10] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.32.15:9042 on sessionstore2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T215883 https://phabricator.wikimedia.org/T93886 [14:43:04] marostegui: it should work now [14:43:21] great thanks! [14:43:43] !log Stop replication on db2070 to test the url_notes (will alert only on IRC) [14:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:49] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) The error didn't appear again (yet) but I created a task with Dell worst case they push back...best they send a DIMM. We're less than 30 days from end of warranty. [14:47:00] 10Operations, 10ops-eqiad: mw1264 DIMM error - https://phabricator.wikimedia.org/T217274 (10Cmjohnson) The error didn't appear again (yet) but I created a task with Dell worst case they push back...best they send a DIMM. We're less than 30 days from end of warranty. [14:47:03] (03CR) 10Jbond: [C: 03+2] Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:47:11] (03PS1) 10Ottomata: eventgate-analytics - Fix liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/496454 (https://phabricator.wikimedia.org/T218255) [14:47:19] (03PS2) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [14:50:56] 10Operations, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): labstore1006 spontaneous reboot - https://phabricator.wikimedia.org/T217473 (10Cmjohnson) I updated all the F/W on this server. I am removing the dc ops tag. If this becomes a h/w issue please add back. [14:53:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventgate-analytics - Fix liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/496454 (https://phabricator.wikimedia.org/T218255) (owner: 10Ottomata) [14:53:34] (03PS2) 10Mholloway: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) [14:53:37] !log analytics-tool1003 - stopping idle screen session [14:53:37] (03PS2) 10Mholloway: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) [14:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:39] (03PS2) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [14:56:58] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 790.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:57:07] mutante: ^ \o/ [14:57:10] marostegui: URL :) [14:57:29] !log Start replication on db2070 after testing url_notes [14:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:38] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:59:24] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - Fix liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/496454 (https://phabricator.wikimedia.org/T218255) (owner: 10Ottomata) [14:59:30] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Fix liveness and readiness probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/496454 (https://phabricator.wikimedia.org/T218255) (owner: 10Ottomata) [14:59:33] (03PS11) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:00:18] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:02:28] (03CR) 10Dbarratt: [C: 03+1] Enforce 8 char password length requirements for non-privileged users on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496260 (owner: 10Dmaza) [15:02:42] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Done with CPT), 10Patch-For-Review, and 2 others: Setup automated topk wide row reporting - https://phabricator.wikimedia.org/T147366 (10CCicalese_WMF) [15:03:04] 10Operations, 10Core Platform Team Kanban (Done with CPT), 10Services (blocked): Set warning thresholds for average cluster utilization - https://phabricator.wikimedia.org/T76306 (10CCicalese_WMF) [15:04:32] (03PS12) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:07:02] (03CR) 10Mobrovac: [C: 04-1] [EventBus] Decrease timeout and use hasty mode for analytics. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496446 (https://phabricator.wikimedia.org/T218260) (owner: 10Ppchelko) [15:10:59] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10Dzahn) [15:11:57] (03CR) 10Mathew.onipe: [C: 03+1] "Let's run on it on relforge to get more feedback. I'm sure something is waiting to break with this." [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:12:50] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10Dzahn) affected circuit: https://netbox.wikimedia.org/circuits/circuits/31/ [15:15:05] (03CR) 10Muehlenhoff: [C: 03+1] Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [15:15:14] 10Operations, 10Gerrit, 10Phabricator: No longer possible to make CORS requests from Phabricator to Gerrit - https://phabricator.wikimedia.org/T218308 (10Jdlrobson) [15:18:07] (03PS2) 10Ppchelko: [EventBus] Decrease timeout and use hasty mode for analytics. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496446 (https://phabricator.wikimedia.org/T218260) [15:18:29] (03CR) 10Volans: "Hard to tell if all will work without a test instance, but I've left some comment inline" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:23:46] (03CR) 10Volans: [C: 03+1] "LGTM, although I might miss some low level context here." (035 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494957 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:25:13] (03CR) 10Muehlenhoff: Add cookbook for elastic6 upgrade (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:25:42] (03PS13) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:26:31] (03CR) 10Gehel: Add cookbook for elastic6 upgrade (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:30:30] (03PS14) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:31:02] (03CR) 10Gehel: Add cookbook for elastic6 upgrade (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:34:55] (03PS1) 10Alexandros Kosiaris: Add sessionstore[12]00[123]-a.site.wmnet RRs [dns] - 10https://gerrit.wikimedia.org/r/496462 (https://phabricator.wikimedia.org/T215883) [15:35:57] (03CR) 10Mholloway: "Should I take the database/cluster stanzas out for now (to be defined later when enabling in production), just to unblock getting this ont" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [15:36:02] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: No longer possible to make CORS requests from Phabricator to Gerrit - https://phabricator.wikimedia.org/T218308 (10chasemp) [15:39:04] (03CR) 10Gergő Tisza: [C: 03+1] "No, having them there as a placeholder is fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [15:40:58] (03PS3) 10Mholloway: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) [15:41:00] (03PS3) 10Mholloway: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) [15:41:02] (03PS3) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [15:41:03] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is CRITICAL: 141 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:42:11] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) a new ticket has been created with Dell [15:42:14] (03CR) 10Mholloway: "Ack, will put them back..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [15:42:39] that’s likely due to the event yesterday, having a look [15:43:07] (03CR) 10Muehlenhoff: Add cookbook for elastic6 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [15:43:38] (03PS4) 10Mholloway: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) [15:43:40] (03PS4) 10Mholloway: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) [15:43:42] (03PS4) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [15:45:33] PROBLEM - Host sessionstore1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:33] PROBLEM - Host sessionstore1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:03] PROBLEM - Host sessionstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:14] 10Operations, 10netops: eqiad - eqord Telia link down - IC-314533 - https://phabricator.wikimedia.org/T218307 (10Dzahn) I sent an email to Telia with the circuit ID and time. They responded saying "Its up now ? There is no scheduled maintenance from our side. " and i said No, it's still down, our monitoring sh... [15:46:16] ignore these ^ [15:46:17] what's up with that [15:46:20] service is not ready [15:46:21] ah [15:46:23] rog [15:46:57] (03PS3) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [15:47:29] seeing an increase in mw info logging since 13:09 [15:49:14] from channel “deprecated” [15:49:26] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) We are still having errors, I am depooling. @Papaul ` [Thu Mar 14 11:56:00 2019] perf: interrupt took too long (4960 > 4946), lowering kernel.perf_event_max_samp... [15:49:27] RECOVERY - Host sessionstore1002 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [15:49:33] RECOVERY - Host sessionstore1001 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [15:49:37] RECOVERY - Host sessionstore1003 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [15:49:54] Use of ParserOutput::getModuleScripts was deprecated in MediaWiki 1.33. [Called from ApiParsoidBatch::preprocess in /srv/mediawiki/php-1.33.0-wmf.21/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php at line 229] [15:50:07] RECOVERY - cassandra-a service on sessionstore1003 is OK: OK - cassandra-a is active [15:50:48] 09:10:01 <+logmsgbot> !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.21 [15:50:54] looks like it correlates with that release herron [15:51:10] (apologies for my timestamp in local time) [15:51:49] yeah indeed it does [15:52:33] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) We are still having errors ` [Thu Mar 14 14:42:19 2019] mce: [Hardware Error]: Machine check events logged [Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: HAN... [15:52:37] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:52:39] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:52:43] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:52:44] again, ignore [15:52:54] (03CR) 10CRusnov: "Plumbing should be good now." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496229 (owner: 10CRusnov) [15:53:31] (03PS1) 10Alexandros Kosiaris: sessionstore: Switch to using the -a addresses [puppet] - 10https://gerrit.wikimedia.org/r/496472 (https://phabricator.wikimedia.org/T215883) [15:53:38] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: common: add proper ordering for nova policy [puppet] - 10https://gerrit.wikimedia.org/r/496474 [15:53:43] PROBLEM - cassandra-a service on sessionstore1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:54:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add sessionstore[12]00[123]-a.site.wmnet RRs [dns] - 10https://gerrit.wikimedia.org/r/496462 (https://phabricator.wikimedia.org/T215883) (owner: 10Alexandros Kosiaris) [15:57:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: common: add proper ordering for nova policy [puppet] - 10https://gerrit.wikimedia.org/r/496474 (owner: 10Arturo Borrero Gonzalez) [15:57:23] (03PS4) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [15:58:33] (03PS2) 10Alexandros Kosiaris: sessionstore: Switch to using the -a addresses [puppet] - 10https://gerrit.wikimedia.org/r/496472 (https://phabricator.wikimedia.org/T215883) [15:58:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] sessionstore: Switch to using the -a addresses [puppet] - 10https://gerrit.wikimedia.org/r/496472 (https://phabricator.wikimedia.org/T215883) (owner: 10Alexandros Kosiaris) [15:59:06] (03CR) 10Mobrovac: [C: 03+1] [EventBus] Decrease timeout and use hasty mode for analytics. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496446 (https://phabricator.wikimedia.org/T218260) (owner: 10Ppchelko) [15:59:20] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T218307 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:20] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T218307 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:38] (03PS22) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [15:59:40] (03PS1) 10Jcrespo: mariadb-backups-monitoring: Link to more specific subpage [puppet] - 10https://gerrit.wikimedia.org/r/496475 (https://phabricator.wikimedia.org/T205626) [16:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:41] (03PS2) 10Jcrespo: mariadb-backups-monitoring: Link to more specific subpage on icinga [puppet] - 10https://gerrit.wikimedia.org/r/496475 (https://phabricator.wikimedia.org/T205626) [16:00:54] (03PS3) 10Jcrespo: mariadb-backups-monitoring: Link to more specific subpage on icinga [puppet] - 10https://gerrit.wikimedia.org/r/496475 (https://phabricator.wikimedia.org/T205626) [16:01:35] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [16:01:37] 10Operations, 10Discovery-Search, 10Elasticsearch: cleanup the custom elasticsearch_${version}@ systemd unit in favor of an override configuration - https://phabricator.wikimedia.org/T218315 (10Gehel) [16:02:13] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: drop python-dogpile.cache from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496476 (https://phabricator.wikimedia.org/T216497) [16:02:24] 10Operations, 10Discovery-Search, 10Elasticsearch: cleanup the custom elasticsearch_${version}@ systemd unit in favor of an override configuration - https://phabricator.wikimedia.org/T218315 (10Gehel) p:05Triage→03High [16:02:29] (03PS1) 10Ottomata: eventgaate-analytics - Enable rdkafka statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496477 (https://phabricator.wikimedia.org/T218305) [16:02:35] RECOVERY - cassandra-a service on sessionstore1002 is OK: OK - cassandra-a is active [16:02:41] !log T216497 drop python-dogpile.cache from jessie-wikimedia/openstack-mitaka-jessie [16:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:44] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [16:02:47] (03CR) 10Volans: [C: 03+1] "Ack, LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496229 (owner: 10CRusnov) [16:03:04] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: drop python-dogpile.cache from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496476 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [16:03:29] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update to upstream v2.5.8 tag. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496229 (owner: 10CRusnov) [16:04:17] !log reboot one final time all sessionstore[12]00[123] servers [16:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:23] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: drop python-dogpile.cache from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496476 (https://phabricator.wikimedia.org/T216497) [16:05:01] PROBLEM - Host sessionstore2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop python-dogpile.cache from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/496476 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [16:05:27] RECOVERY - Host sessionstore2002 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [16:06:04] (03PS5) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [16:06:11] (03CR) 10DCausse: [C: 03+1] Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:06:44] jouncebot: now [16:06:44] For the next 0 hour(s) and 53 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1600) [16:07:20] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10hashar) Eventually we had HHVM segfault that started to be very problematic since last week at least (T216689). A stacktrace points at pthreads_create and the [[ https://metadata.ftp-master.debian.org/changelogs//main... [16:07:33] (03PS4) 10Jcrespo: mariadb-backups-monitoring: Link to more specific subpage on icinga [puppet] - 10https://gerrit.wikimedia.org/r/496475 (https://phabricator.wikimedia.org/T205626) [16:08:01] !log reimaging cloudvirt1015 again [16:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:03] PROBLEM - cassandra-a SSL 10.192.32.15:7001 on sessionstore2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:08:13] PROBLEM - cassandra-a service on sessionstore1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:08:15] ACKNOWLEDGEMENT - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is CRITICAL: 141 ge 130 Herron Discussing in T206675 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [16:08:19] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:08:19] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:08:23] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:08:25] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:08:25] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:08:41] PROBLEM - Check systemd state on sessionstore2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:08:57] PROBLEM - Check systemd state on sessionstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:09:01] DNS query for 'sessionstore1002-a.eqiad.wmnet' failed: NXDOMAIN [16:09:07] PROBLEM - Check whether ferm is active by checking the default input chain on sessionstore2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:09:18] sigh this is probably the resolves still cahcing the negative answers [16:09:19] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt1015: add overrides for interface names [puppet] - 10https://gerrit.wikimedia.org/r/496478 [16:09:33] PROBLEM - Check systemd state on sessionstore2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:09:35] (03PS15) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:09:46] (03PS6) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [16:10:26] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups-monitoring: Link to more specific subpage on icinga [puppet] - 10https://gerrit.wikimedia.org/r/496475 (https://phabricator.wikimedia.org/T205626) (owner: 10Jcrespo) [16:11:09] PROBLEM - cassandra-a service on sessionstore1002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:11:18] (03PS7) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [16:11:19] (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/compiler1002/15130/cumin1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [16:11:22] (03CR) 10Dzahn: [C: 03+2] mariadb-backups-monitoring: Link to more specific subpage on icinga [puppet] - 10https://gerrit.wikimedia.org/r/496475 (https://phabricator.wikimedia.org/T205626) (owner: 10Jcrespo) [16:11:29] (03PS2) 10Arturo Borrero Gonzalez: hiera: cloudvirt1015: add overrides for interface names [puppet] - 10https://gerrit.wikimedia.org/r/496478 [16:12:21] RECOVERY - cassandra-a service on sessionstore1002 is OK: OK - cassandra-a is active [16:12:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt1015: add overrides for interface names [puppet] - 10https://gerrit.wikimedia.org/r/496478 (owner: 10Arturo Borrero Gonzalez) [16:14:45] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgaate-analytics - Enable rdkafka statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496477 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [16:14:47] (03PS20) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [16:15:20] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) My theory so far, until we can get confirmation from JTAC (as I can't find any doc confirming it or not), is that the firewall action `next-ip` can only be applie... [16:16:01] Is puppet swat occurring? :) [16:16:02] (03PS16) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:17:06] (03PS1) 10Ottomata: Add eventgate-analytics-0.0.10.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/496482 [16:17:08] (03CR) 10Volans: [C: 03+1] "Looks ok to me. There are potential things that could be added, might be out of scope. Small nitpick in the docstring." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:17:24] after scrolling up I see no :) *slaps self for not scrolling before* [16:18:10] I guess the only blocker on netbox sync is the upgrade, I think i've got the plumbing correct now? [16:18:12] going to deploy a netbox upgrade in a mo if that's okay with everyone [16:18:39] (03CR) 10Mathew.onipe: [C: 03+1] logstash: tune shard size check to reflect the current known good sizes [puppet] - 10https://gerrit.wikimedia.org/r/496184 (owner: 10Gehel) [16:18:58] (03CR) 10Jbond: [C: 03+2] Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [16:19:10] (03PS8) 10Jbond: Add logging to cumin nodes [puppet] - 10https://gerrit.wikimedia.org/r/496415 (https://phabricator.wikimedia.org/T116011) [16:19:39] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add eventgate-analytics-0.0.10.tgz [deployment-charts] - 10https://gerrit.wikimedia.org/r/496482 (owner: 10Ottomata) [16:20:30] PROBLEM - cassandra-a CQL 10.192.48.132:9042 on sessionstore2003 is CRITICAL: connect to address 10.192.48.132 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:22:16] PROBLEM - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:22:34] (03CR) 10Herron: [C: 03+1] logstash: tune shard size check to reflect the current known good sizes [puppet] - 10https://gerrit.wikimedia.org/r/496184 (owner: 10Gehel) [16:22:48] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:50] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:22:51] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:35] (03CR) 10Gehel: Add cookbook for elastic6 upgrade (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:23:53] (03PS17) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:24:00] PROBLEM - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is CRITICAL: connect to address 10.192.16.95 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:24:26] (03PS3) 10Gehel: logstash: tune shard size check to reflect the current known good sizes [puppet] - 10https://gerrit.wikimedia.org/r/496184 [16:25:28] (03CR) 10Gehel: [C: 03+2] logstash: tune shard size check to reflect the current known good sizes [puppet] - 10https://gerrit.wikimedia.org/r/496184 (owner: 10Gehel) [16:25:48] PROBLEM - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [16:28:05] (03CR) 10Volans: Add cookbook for elastic6 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:28:14] ^ scheduling more downtimes. sessionstore* [16:28:26] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:28:55] (03CR) 10Andrew Bogott: [C: 03+2] openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [16:29:05] (03PS18) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:29:14] (03PS21) 10Andrew Bogott: openldap: role/profile refactor for 'labs' and 'labtest' roles [puppet] - 10https://gerrit.wikimedia.org/r/496335 (https://phabricator.wikimedia.org/T46722) [16:29:17] (03CR) 10Gehel: Add cookbook for elastic6 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:29:28] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:29:56] (03CR) 10Mathew.onipe: Add cookbook for elastic6 upgrade (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:30:16] RECOVERY - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is OK: SSL OK - Certificate sessionstore2001-a valid until 2021-03-13 12:44:12 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [16:30:34] RECOVERY - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is OK: TCP OK - 0.000 second response time on 10.192.16.95 port 9042 https://phabricator.wikimedia.org/T93886 [16:30:53] !log addshore@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: [[gerrit:496481]] TermSqlIndex, track calls to getTermsOfEntities (duration: 00m 50s) [16:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:29] (03PS1) 10GTirloni: fullstackd: Use new Debian 9.8 image [puppet] - 10https://gerrit.wikimedia.org/r/496485 (https://phabricator.wikimedia.org/T218314) [16:32:11] !log add default deny to mr1-* junos-host policies - T218234 [16:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:34] (03PS2) 10GTirloni: fullstackd: Use new Debian 9.8 image [puppet] - 10https://gerrit.wikimedia.org/r/496485 (https://phabricator.wikimedia.org/T218314) [16:33:17] (03CR) 10GTirloni: [C: 03+2] fullstackd: Use new Debian 9.8 image [puppet] - 10https://gerrit.wikimedia.org/r/496485 (https://phabricator.wikimedia.org/T218314) (owner: 10GTirloni) [16:34:22] RECOVERY - cassandra-a service on sessionstore1003 is OK: OK - cassandra-a is active [16:35:01] (03PS1) 10Elukey: profile::hadoop::common: add ssl parameter to ssl-config.xml's set [puppet] - 10https://gerrit.wikimedia.org/r/496486 (https://phabricator.wikimedia.org/T217412) [16:35:03] (03CR) 10Bstorm: "At least the last time I did something with the client packages on stretch, it was a tad exciting because of the hiera-defined version inf" [puppet] - 10https://gerrit.wikimedia.org/r/495757 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [16:35:04] RECOVERY - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is OK: SSL OK - Certificate sessionstore2003-a valid until 2021-03-13 12:44:14 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [16:35:55] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: add ssl parameter to ssl-config.xml's set [puppet] - 10https://gerrit.wikimedia.org/r/496486 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [16:36:04] RECOVERY - cassandra-a CQL 10.192.48.132:9042 on sessionstore2003 is OK: TCP OK - 0.000 second response time on 10.192.48.132 port 9042 https://phabricator.wikimedia.org/T93886 [16:36:06] (03PS1) 10Andrew Bogott: Remove ::profile::prometheus::openldap_exporter from labtest [puppet] - 10https://gerrit.wikimedia.org/r/496487 [16:37:01] (03CR) 10Andrew Bogott: [C: 03+2] Remove ::profile::prometheus::openldap_exporter from labtest [puppet] - 10https://gerrit.wikimedia.org/r/496487 (owner: 10Andrew Bogott) [16:37:20] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:42] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:48] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:38:10] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:38:41] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:38:54] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:39:14] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:39:27] (03CR) 10Cwhite: [C: 03+1] logstash: add udp json logback localhost compatability endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [16:39:52] * arturo looking at that weird labnet1001 page [16:39:53] ACKNOWLEDGEMENT - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack andrew bogott Andrew will investigate https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:40:05] (03PS19) 10Gehel: Add cookbook for elastic6 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/493436 (owner: 10DCausse) [16:40:27] arturo: I just published a new image and updated fullstackd, maybe it was caught in between updating it [16:40:37] arturo: it's weird it's coming from labnet1001 though [16:40:53] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:41:02] yes [16:41:35] maybe puppet ran and killed something that icinga didn't like (fullstackd is a daemon) [16:42:08] This is me, investigating, PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:15] it's just IPv6 [16:42:46] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:42:46] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10akosiaris) sessionstore hosts setup ` akosiaris@sessionstore1001:~$ nodetoo... [16:45:17] mr1 oob v6 should come back [16:45:18] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [16:45:52] !log crusnov@deploy1001 Started deploy [netbox/deploy@59430dd]: Deploy Ganeti Sync and Upgrade to upstream v2.5.8 - T215229 [16:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:56] T215229: Keep Ganeti VMs synchronized in Netbox - https://phabricator.wikimedia.org/T215229 [16:46:23] !log crusnov@deploy1001 Finished deploy [netbox/deploy@59430dd]: Deploy Ganeti Sync and Upgrade to upstream v2.5.8 - T215229 (duration: 00m 30s) [16:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:30] RECOVERY - Long running screen/tmux on analytics-tool1003 is OK: OK: No SCREEN or tmux processes detected. [16:47:58] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 44.57 ms [16:48:36] 10Operations, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): labstore1006 spontaneous reboot - https://phabricator.wikimedia.org/T217473 (10Bstorm) Thanks! [16:49:39] !log crusnov@deploy1001 Started deploy [netbox/deploy@59430dd]: Deploy Ganeti Sync and Upgrade to upstream v2.5.8 (netmon1002) - T215229 [16:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:29] !log crusnov@deploy1001 Finished deploy [netbox/deploy@59430dd]: Deploy Ganeti Sync and Upgrade to upstream v2.5.8 (netmon1002) - T215229 (duration: 00m 50s) [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:26] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:29] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:51:29] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:18] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) Hi Dave! Which list does this pertain to? Arbcom-l ? [16:57:00] (03PS2) 10GTirloni: diamond: Do not collect disk usage for devicemapper devices [puppet] - 10https://gerrit.wikimedia.org/r/496133 (https://phabricator.wikimedia.org/T218185) [16:57:04] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:58:06] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:58:45] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10WormTT) All of them technically, but Arbcom-l is the one I'm concerned about. There was far les... [16:59:23] (03CR) 10GTirloni: [C: 03+2] diamond: Do not collect disk usage for devicemapper devices [puppet] - 10https://gerrit.wikimedia.org/r/496133 (https://phabricator.wikimedia.org/T218185) (owner: 10GTirloni) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1700). [17:02:05] (03PS1) 10Dzahn: rm mediawiki::generic_monitoring (Apple bridge) [puppet] - 10https://gerrit.wikimedia.org/r/496489 [17:05:54] (03PS2) 10Herron: logstash: send mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/495962 (https://phabricator.wikimedia.org/T213899) [17:07:15] 10Operations, 10netops: Management routers: filter traffic from external to junos-host - https://phabricator.wikimedia.org/T218234 (10ayounsi) 05Open→03Resolved All patched. No need for this task to be private anymore. [17:07:20] 10Operations, 10netops: Management routers: filter traffic from external to junos-host - https://phabricator.wikimedia.org/T218234 (10ayounsi) [17:08:39] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:05Cmjohnson→03RobH So this has a memory error and is out of warranty. This means we should look at decommissioning this host and ordering a replacement.... [17:08:43] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10akosiaris) Base images for jessie and stretch have been built and pushed to the docker registry. [17:12:05] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: No longer possible to make CORS requests from Phabricator to Gerrit - https://phabricator.wikimedia.org/T218308 (10Jdlrobson) Did something change regarding https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#site.allo... [17:12:31] !log Depool mw2206 - T215415 [17:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:34] T215415: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 [17:12:52] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: No longer possible to make CORS requests from Phabricator to Gerrit - https://phabricator.wikimedia.org/T218308 (10Jdlrobson) (and to be clear I'm only interested in read only requests here) [17:15:00] (03PS2) 10Elukey: profile::hadoop::common: add ssl parameter to ssl-config.xml's set [puppet] - 10https://gerrit.wikimedia.org/r/496486 (https://phabricator.wikimedia.org/T217412) [17:15:14] !log Pool mw1280 back - T218006 [17:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:17] T218006: mw1280 crashed - https://phabricator.wikimedia.org/T218006 [17:16:30] (03PS1) 10Elukey: Add hiera overrides to analytics1037 to avoid using a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/496490 [17:17:18] !log arlolra@deploy1001 Started deploy [parsoid/deploy@8cf4107]: Updating Parsoid to f3e2209 [17:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:22] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: No longer possible to make CORS requests from Phabricator to Gerrit - https://phabricator.wikimedia.org/T218308 (10Dzahn) I don't see allowOriginRegex in our Gerrit config at all. That should mean "By default, unset, denying all cross-ori... [17:18:33] (03PS1) 10CRusnov: Minor bugfixes to netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 [17:20:51] (03PS2) 10Elukey: Add hiera overrides to analytics1037 to avoid using a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/496490 [17:21:04] 10Operations, 10Gerrit, 10Phabricator, 10Security-Team, 10Traffic: No longer possible to make CORS requests from Phabricator to Gerrit - https://phabricator.wikimedia.org/T218308 (10Dzahn) It should be the CSP on the Phabricator side. [17:21:09] 10Operations, 10Thumbor, 10hardware-requests: reallocate former image scaler to thumbor use - https://phabricator.wikimedia.org/T218323 (10RobH) p:05Triage→03Normal [17:21:18] (03PS3) 10Elukey: Add hiera overrides to analytics1037 to avoid using a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/496490 [17:22:15] 10Operations, 10Thumbor, 10hardware-requests: reallocate former image scaler to thumbor use - https://phabricator.wikimedia.org/T218323 (10RobH) [17:23:15] godog: can I query prometheus metric names with wildcard or regexes? [17:23:19] or does that only work for labels? [17:24:27] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@8cf4107]: Updating Parsoid to f3e2209 (duration: 07m 09s) [17:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:14] nm go.dog i think i figured it outj {__name__+~...} [17:25:19] ottomata: you can yeah, use __name__ as a label [17:25:22] yeah that's right [17:25:35] (03CR) 10Elukey: [C: 03+2] Add hiera overrides to analytics1037 to avoid using a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/496490 (owner: 10Elukey) [17:26:01] ottomata: FWIW what are you trying to match? [17:26:16] eventgate_analytics_rdkafka_producer_guaranteed_eventgate_analytics_brokers_kafka_jumbo1006_eqiad_wmnet_9092_1006_tx [17:26:17] (03PS1) 10CRusnov: Expose CA cert to ganeti sync. [puppet] - 10https://gerrit.wikimedia.org/r/496495 [17:26:28] goinig to template out the pieces [17:26:29] (03PS1) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [17:28:03] (03PS2) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [17:28:07] oh man, hostnames in the metric names? [17:29:09] ottomata: is that from statsd_exporter or prometheus native metrics out of curiosity? [17:29:19] win 35 [17:29:26] oops, i did it again [17:30:15] statsd_exporter [17:30:39] Could someone give this a +2? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/496260 [17:30:49] godog: its statsd_exporter using node-rdkafka-statsd [17:30:58] which just flattens the metrics stats object that librdkafka returns [17:31:27] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10RobH) Please note we may steal mw2245 for thumbor1005 use on T218323 [17:31:35] (03PS2) 10Elukey: role::aqs: use nodejs-10 for the aqs service [puppet] - 10https://gerrit.wikimedia.org/r/496110 (https://phabricator.wikimedia.org/T210706) [17:31:35] ottomata: ack, we should be writing a mapping for statsd_exporter to use if not already, to get sensible metrics [17:31:45] happy to help with that of course [17:32:15] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [17:32:17] (03CR) 10Volans: [C: 04-1] "missing param, looks good otherwise" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 (owner: 10CRusnov) [17:32:56] !log Updated Parsoid to f3e2209 (T213950) [17:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:59] T213950: Links: External links with special characters, and surrounded by square brakets, are not rendered properly - https://phabricator.wikimedia.org/T213950 [17:33:19] (03CR) 10Elukey: [C: 03+2] role::aqs: use nodejs-10 for the aqs service [puppet] - 10https://gerrit.wikimedia.org/r/496110 (https://phabricator.wikimedia.org/T210706) (owner: 10Elukey) [17:33:34] (03CR) 10Volans: [C: 04-1] Minor bugfixes to netbox sync. (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 (owner: 10CRusnov) [17:34:34] (03CR) 10Volans: [C: 03+1] "LGTM, nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496495 (owner: 10CRusnov) [17:34:57] Pchelolo: mobrovac ^^^ (see godog) [17:35:06] would it be worth diving into that now rather than later? [17:36:58] ottomata: not sure.. for CP we have enabled librdkafka metrics as some point but I've personally found them entirely useless and removed eventually [17:37:06] (03PS2) 10CRusnov: Minor bugfixes to netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 [17:37:24] Pchelolo: in this case, i want to know if the timeouts are caused by local kafka queues filling up [17:37:31] producer queuse [17:37:44] i want a dash like this [17:37:44] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1 [17:38:03] ah yes that's the other option of course, not having the metrics at all, however if they are there to stay I'd recommend having sensibly-named metrics sooner rather than later [17:38:09] (03CR) 10CRusnov: "Thanks :D" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 (owner: 10CRusnov) [17:38:19] (03PS1) 10Ema: varnish: set /w/load.php Age to 0 [puppet] - 10https://gerrit.wikimedia.org/r/496497 (https://phabricator.wikimedia.org/T105657) [17:39:47] i guess, godog Pchelolo , i'm about to embark on a complicated dashboard [17:39:58] if we are going to change the metrics, i'd rather do that first [17:39:59] (03PS4) 10Herron: logstash: add udp json logback localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) [17:40:19] (03PS2) 10CRusnov: Expose CA cert to netbox ganeti sync. [puppet] - 10https://gerrit.wikimedia.org/r/496495 [17:40:45] hmm, maybe statsd_exporter is the way to go here [17:40:46] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Patch-For-Review: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) a:03Krinkle [17:40:47] akosiaris: thoughts? [17:41:04] we could make some configs for statsd_expoter to know how to map incoming stats metrics into labels? [17:41:42] changing the service-runner metrics stuff does sound a little difficult [17:42:10] or maybe there can be a totally separate prometheus metrics interface for service-runner? [17:42:21] (03PS5) 10Herron: logstash: add udp json logback localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) [17:42:24] i guess i could build that into eventgate itself rather than expect service-runner to do it [17:42:24] hm. [17:42:56] (03PS3) 10CRusnov: Expose CA cert to netbox ganeti sync. [puppet] - 10https://gerrit.wikimedia.org/r/496495 [17:43:55] !log Deploying AQS using scap (node10 upgrade) [17:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:31] ottomata: service-runner base metrics should have statsd_mappings already (e.g. for gc times) if that's useful [17:44:40] ya but only for the built in ones [17:45:08] (03CR) 10Herron: logstash: add udp json logback localhost compatibility endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [17:45:20] HMmmM, i could make a node-rdkafka-prometheus lib [17:45:47] !log mforns@deploy1001 Started deploy [analytics/aqs/deploy@13203f1]: Deploying AQS for node10 upgrade [17:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:02] simliar to the one that emits rdkafka stats for statsd instead [17:46:19] (03CR) 10CRusnov: [C: 03+2] Expose CA cert to netbox ganeti sync. [puppet] - 10https://gerrit.wikimedia.org/r/496495 (owner: 10CRusnov) [17:50:07] (03PS13) 10Gehel: elasticsearch: mjolnir bulk update lag [puppet] - 10https://gerrit.wikimedia.org/r/495693 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [17:51:00] OK for me to deploy a quick fix to the ParsoidBatchAPI extension? Debug noise fix per herron. [17:51:09] jouncebot: next [17:51:09] In 0 hour(s) and 8 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1800) [17:51:15] (03CR) 10Gehel: [C: 03+2] elasticsearch: mjolnir bulk update lag [puppet] - 10https://gerrit.wikimedia.org/r/495693 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [17:51:18] I'll be quick. [17:54:56] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) [17:57:30] Could someone give this a +2? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/496260 [17:58:11] PROBLEM - Mjolnir bulk update failure check - eqiad on icinga2001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [17:58:11] PROBLEM - Mjolnir bulk update failure check - codfw on icinga2001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [17:58:25] Oops [17:58:27] ^ new check, unexpected failure, checking [17:59:27] davidwbarratt: I can put it into the SWAT window in 30 seconds' time? [17:59:34] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php: Hot-deploy I2842dfea to reduce deprecation spam after T206675 deploy of wmf.21 (duration: 00m 49s) [17:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:37] T206675: 1.33.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T206675 [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1800). [18:00:04] alaa_wmde and tim_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] I can SWAT. [18:00:13] James_F sure. :) it's just a beta change. :) [18:00:14] Thanks James [18:00:22] davidwbarratt: I'll do yours now. [18:00:28] tim_WMDE: Alaa asked me if you could go first with your patches. [18:00:40] Sure [18:00:53] (03CR) 10Jforrester: [C: 03+2] Enforce 8 char password length requirements for non-privileged users on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496260 (owner: 10Dmaza) [18:01:05] Thiemo_WMDE ? [18:01:48] tim_WMDE: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/494270 first? [18:02:05] (03PS8) 10Jforrester: Set up exceptions for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:02:05] Do we have two Alaa here? ;-) I meant alaa_wmde. [18:02:16] (03PS9) 10Jforrester: Set up exceptions for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:02:37] James_F correct [18:02:46] Oh Thiemo_WMDE there's another Alaa :P [18:02:57] (03PS6) 10Jforrester: Add default user config for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495667 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:03:06] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) @RobH Any reason why we have to put those servers in the spares tracking sheet if the warranty expired date is Feb. 26, 2018 ? [18:03:07] (03Merged) 10jenkins-bot: Enforce 8 char password length requirements for non-privileged users on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496260 (owner: 10Dmaza) [18:03:07] PROBLEM - Cirrus Update lag check - codfw on icinga2001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-7d&to=now [18:03:35] (03CR) 10Jforrester: [C: 03+2] Set up exceptions for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:04:33] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10RobH) a:05Papaul→03faidon We don't automatically throw away out of warranty systems, so it is really up to @faidon if we decommission and dispose of th... [18:04:53] PROBLEM - Cirrus Update lag check - eqiad on icinga2001 is CRITICAL: (null) https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-7d&to=now [18:05:07] (03Merged) 10jenkins-bot: Set up exceptions for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:05:16] dcausse, ebernhardson ---^ [18:05:27] !log mforns@deploy1001 Finished deploy [analytics/aqs/deploy@13203f1]: Deploying AQS for node10 upgrade (duration: 19m 40s) [18:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:35] tim_WMDE: Now live on mwdebug1002, please check. [18:05:42] also gehel [18:05:48] is it related to some ongoing wokr? [18:05:50] *work? [18:05:57] (03PS1) 10Andrew Bogott: openldap profile: remove profile::openldap::secondary_hostname arg [puppet] - 10https://gerrit.wikimedia.org/r/496502 [18:05:59] (03PS1) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [18:06:25] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) Thanks. [18:06:26] elukey: probably related to the recently merged problematic check [18:06:32] 10Operations, 10Data-Services, 10Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10Bstorm) [18:06:32] elukey: checking [18:06:51] ack thanks! [18:07:02] elukey: Oh yeah, that's the second new check, looks like both are failing [18:07:17] (03CR) 10jerkins-bot: [V: 04-1] openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [18:08:27] tim_WMDE: Looks OK to me, if you are content I will sync? [18:08:28] !log change email for KStineRowe (WMF) on officewiki, collabwiki, SUL [18:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:14] (03CR) 10Filippo Giunchedi: "See inline, lgtm overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495962 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [18:09:58] (03PS2) 10Andrew Bogott: openldap profile: remove profile::openldap::secondary_hostname arg [puppet] - 10https://gerrit.wikimedia.org/r/496502 [18:10:21] I'm in [18:10:32] (03PS1) 10Papaul: DNS: remove mgmt DNS name for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/496504 [18:10:36] davidwbarratt: BTW, should be synching to Beta any time now. [18:10:57] thanks @Thiemo_WMDE and @tim_WMDE and sorry about the delay (traffic was too bad for some reason) [18:11:12] Heya, alaa_wmde. [18:11:28] hey @James_F [18:11:31] (03CR) 10Andrew Bogott: [C: 03+2] openldap profile: remove profile::openldap::secondary_hostname arg [puppet] - 10https://gerrit.wikimedia.org/r/496502 (owner: 10Andrew Bogott) [18:11:42] I'll get to your patch once I've done with tim_WMDE's two. [18:11:50] Unless you want me to rush it now? [18:11:55] tha's great thanks a lot [18:11:55] twentyafterfour: wrt T217938 is this some sort of sudo apt-get command that we need? [18:11:56] T217938: Cannot access beta cluster db - https://phabricator.wikimedia.org/T217938 [18:12:04] no no need to rush @James_F [18:12:47] OK. [18:12:54] (03CR) 10Jforrester: [C: 03+2] Add default user config for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495667 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:13:28] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T217436 Set up exceptions for rollback confirmation (duration: 00m 49s) [18:13:28] James_F Perfect! thanks! [18:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:31] T217436: Enable confirmation prompt for rollback by default - https://phabricator.wikimedia.org/T217436 [18:13:34] (03CR) 10jenkins-bot: Enforce 8 char password length requirements for non-privileged users on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496260 (owner: 10Dmaza) [18:13:36] (03CR) 10jenkins-bot: Set up exceptions for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494270 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:14:00] (03Merged) 10jenkins-bot: Add default user config for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495667 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:14:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) [18:14:25] (03CR) 10jenkins-bot: Add default user config for rollback confirmation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495667 (https://phabricator.wikimedia.org/T217436) (owner: 10Tim Eulitz) [18:14:47] tim_WMDE: Second patch is now live on mwdebug1002 (which is the one that actually does something, right?). [18:15:19] (03PS1) 10Elukey: Swap journalnode hosts in Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496505 [18:15:25] Yeah, thats right [18:15:58] (03PS2) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [18:16:59] (03PS2) 10Elukey: Swap journalnode hosts in Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496505 [18:17:03] tim_WMDE: T217436 says dewiki and plwiki but your patch only does dewiki; is that OK? [18:18:23] Yes, the Polish community voted against it and I think we forgot to update the ticket [18:18:31] Oh, OK. :-) [18:18:41] I will update it to reflect that [18:18:52] Done. [18:20:10] (03CR) 10Elukey: [C: 03+2] Swap journalnode hosts in Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/496505 (owner: 10Elukey) [18:21:27] (03PS3) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [18:22:29] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT T217436 Add default user config for rollback confirmation (duration: 00m 48s) [18:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:35] T217436: Enable confirmation prompt for rollback by default - https://phabricator.wikimedia.org/T217436 [18:22:44] (03PS4) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [18:23:37] RoanKattouw: Should I just push yours out? [18:23:45] OK, alaa_wmde, ready/ [18:23:57] yup [18:23:57] James_F: I would appreciate mwdebug first if you don't mind [18:24:08] absolutely! [18:24:10] RoanKattouw: No problem, it's already there (mwdebug1002). [18:24:16] OK looking [18:24:17] (03PS3) 10Jforrester: Enable musical notation datatype in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [18:24:22] (03CR) 10Jforrester: [C: 03+2] Enable musical notation datatype in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [18:24:32] (03PS1) 10Mathew.onipe: icinga: modify cirrus prometheus checks threshold [puppet] - 10https://gerrit.wikimedia.org/r/496509 [18:24:35] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:25:51] (03Merged) 10jenkins-bot: Enable musical notation datatype in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [18:26:04] (03CR) 10jenkins-bot: Enable musical notation datatype in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [18:26:06] (03PS2) 10Mathew.onipe: icinga: modify cirrus prometheus checks threshold [puppet] - 10https://gerrit.wikimedia.org/r/496509 [18:26:49] alaa_wmde: Live on mwdebug1002. Please check. [18:26:55] checking [18:27:28] James_F: Looks good, ship it [18:27:34] RoanKattouw: Doing so. [18:27:54] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) [18:28:14] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) [18:28:36] (03PS5) 10Tchanders: Enforce 8 char password length requirements for non-privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496202 (https://phabricator.wikimedia.org/T211622) (owner: 10Dmaza) [18:28:38] (03PS1) 10Tchanders: Add 'suggestChangeOnLogin' flag to password policies for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496515 [18:29:25] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:29:26] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/GrowthExperiments/modules/help/: SWAT Ib13cf88d GrowthExperiments log fix for closes (duration: 00m 49s) [18:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:36] James_F: Looks good on mwdebug1002 [18:30:02] alaa_wmde: Excellent! Let's ship it. [18:30:48] yes!! [18:31:38] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T216730 Enable musical notation datatype on Wikidata (duration: 00m 48s) [18:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:41] T216730: Enable musical notation datatype - https://phabricator.wikimedia.org/T216730 [18:31:44] OK, there's nothing unmerged on https://gerrit.wikimedia.org/r/q/branch:wmf%252F1.33.0-wmf.21 or on https://gerrit.wikimedia.org/r/q/hashtag:swat-2019-03-14-morning. Last call for the SWAT window or I'll declare it closed (and use it for my own purposes). [18:32:24] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) @jcrespo @Marostegui - The last db server in codfw is db2096. Can you please replace the "name of the host" if we are going to use something else then db209... [18:33:20] 10Operations, 10DC-Ops, 10hardware-requests: Request spare systems to test ipmi password reset cookbook - https://phabricator.wikimedia.org/T218117 (10RobH) a:05RobH→03Cmjohnson Ok, wmf4660 was restbase1005 back in the day. As such, it has NO SSDs any longer (they were migrated) and has no data storage... [18:34:40] 10Operations, 10DC-Ops, 10hardware-requests: Request spare systems to test ipmi password reset cookbook - https://phabricator.wikimedia.org/T218117 (10RobH) Trying to ping or ssh to wmf4660.mgmt.eqiad.wmnet fails, chris will need to investigate and bring mgmt back online. [18:35:26] 10Operations, 10DC-Ops, 10hardware-requests: Request spare systems to test ipmi password reset cookbook - https://phabricator.wikimedia.org/T218117 (10RobH) Also, I've skipped asking for @faidon's permission to use this spare server, since it is a few things: 1) temp allocation just for ipmi scripting 2) out... [18:37:17] !log set protocols bgp group Anycast4 multihop ttl 190 on cr1-codfw - T209989 [18:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:20] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [18:42:57] James_F: can I add one more patch to SWAT? [18:43:10] kostajh: Sure. [18:44:05] James_F: thanks, just a minute [18:44:10] Sure. [18:44:49] gerrit is not letting me create the cherry pick, doing from the CLI now [18:45:27] merge conflict? That's now supported :). [18:47:10] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) @Papaul Please see my warnings at T216137#5002854 for Chris, which applys here. I had suggested to use `dbstore` for these hosts, but @Marostegui didn't agr... [18:47:10] paladox: in what way? [18:47:22] anyway, perhaps it's because https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/496521 didn't finish merging yet [18:47:35] kostajh https://www.gerritcodereview.com/2.16.html#new-features-1 [18:47:54] oh. but not (yet) in our gerrit :) [18:48:01] I'm excited for 2.16 for sure [18:48:24] * paladox added that feature [18:48:46] paladox++ [18:49:07] in matter of fact all those listed polygerrit changes were done by me apart from one which i cherry picked. [18:49:33] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Marostegui) Thanks @Papaul! The rack locations are fine I think. The hostname: I think we still need to discuss them as these hosts will not be a normal database (n... [18:53:06] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php: SWAT Another deprecation fix via I4936d0ce03 (duration: 00m 49s) [18:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:57] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) [18:54:50] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) Stalled on https://phabricator.wikimedia.org/T216528 [18:55:08] (03PS1) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 [18:55:28] 10Operations, 10DC-Ops, 10hardware-requests: Request spare systems to test ipmi password reset cookbook - https://phabricator.wikimedia.org/T218117 (10RobH) Chris is going to set this back up for use, it was disconnected due to reshuffling of systems. [18:56:33] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 (owner: 10CRusnov) [18:56:44] (03CR) 10jerkins-bot: [V: 04-1] Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (owner: 10CRusnov) [18:56:53] kostajh: I've got to run, but I can deploy it in half an hour's time? [18:57:19] James_F: OK, sounds good I'll have the patch ready for you [18:57:23] thanks! [18:57:37] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Marostegui) @jcrespo I am still not sure if dbstore would be a good name, just because their hardware is completely different from the existing dbstoreXXXX, but on t... [19:00:04] 10Operations, 10netops: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) Followed up on the mailing list: > Junos uses the BGP multihop TTL value for BFD as well, and assumes the other side's default TTL is 255. > So if I do: > `lang=diff > [edit protocols bgp group Anycast4 multihop]... [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1900) [19:00:04] thcipriani and paladox: #bothumor I � Unicode. All rise for Gerrit Upgrade deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T1900). [19:00:16] * paladox is here. [19:00:25] * thcipriani here [19:01:26] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Gerrit 2.15.11 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/495960 (https://phabricator.wikimedia.org/T214359) (owner: 10Paladox) [19:02:31] !log set protocols bgp group Anycast4 multihop ttl 193 on cr1/2-eqiad - T209989 [19:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:35] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [19:03:12] RECOVERY - NFS on labstore1006 is OK: TCP OK - 0.036 second response time on 208.80.154.7 port 2049 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [19:03:38] James_F: just registered it on wikitech https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/496529, whenever you're back. thank you! [19:03:41] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) p:05Triage→03Normal [19:03:57] prepping gerrit 2.15.11 on deployment server, will ensure deployment looks normal on gerrit2001 then go to cobalt then restart [19:05:06] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Minor bugfixes to netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496491 (owner: 10CRusnov) [19:05:48] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on gerrit2001 only [19:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:49] 10Operations, 10netops: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) 05Open→03Resolved All done here! [19:05:59] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on gerrit2001 only (duration: 00m 11s) [19:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:43] gerrit2001 looks normal, going on to cobalt followed by (hopefully) quick gerrit restart [19:06:45] 10Operations, 10ops-codfw, 10DBA: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) Thanks guys. Hopefully I will get the information needed before receiving the servers on 03/22/19. [19:07:47] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on cobalt [19:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:58] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on cobalt (duration: 00m 11s) [19:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:41] !log restarting gerrit on cobalt for 2.15.11 upgrade [19:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:52] :) [19:10:45] ah [19:11:10] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [19:11:59] (03CR) 10Ayounsi: [C: 03+2] Icinga, assign bfd check to routers [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [19:12:14] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [19:12:20] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26025 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [19:12:33] !log gerrit back up [19:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:47] (03CR) 10Ayounsi: [C: 03+2] Icinga, assign bfd check to routers [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [19:14:03] (03PS5) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [19:14:05] (03PS1) 10Andrew Bogott: openldap: add read_only switch for ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/496552 (https://phabricator.wikimedia.org/T46722) [19:14:58] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [19:16:10] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:16:58] ^ I would guess they were running during gerrit restart :\ [19:18:35] yeah, it happens [19:18:43] (03PS1) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [19:18:46] (03PS3) 10Ayounsi: Icinga, assign bfd check to routers [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) [19:20:22] (03PS2) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [19:20:53] (03CR) 10Jcrespo: [C: 04-1] "Pending applying volans' and marostegui suggestions any maybe more (purging schedule, specially on current hosts?)." [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:20:55] thcipriani: gerrit fails for me :/ Received disconnect from 2620:0:861:3:208:80:154:85 port 29418:2: Session has timed out waiting for authentication after 60000 ms. [19:21:02] (03PS1) 10Bstorm: Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 [19:21:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 (owner: 10Bstorm) [19:21:46] IPv4 is fine, IPv6 is not [19:22:01] oh [19:22:25] tcp6 0 0 2620:0:861:3:208::29418 :::* LISTEN 17508/java [19:22:25] tcp6 0 0 208.80.154.85:29418 :::* LISTEN 17508/java [19:22:28] but it listens on both [19:22:33] hmm https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb#L216 [19:22:59] (03PS23) 10Jcrespo: mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) [19:23:04] (03CR) 10Aezell: [C: 03+1] "Untested, but it looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496515 (owner: 10Tchanders) [19:23:16] (03CR) 10Jcrespo: [C: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:23:36] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15136/icinga2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/461498 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [19:24:05] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts [puppet] - 10https://gerrit.wikimedia.org/r/494899 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:25:01] hashar: what command are you trying? git clone? [19:25:04] (03CR) 10Jcrespo: [C: 03+1] "> Doubtful that I have the cycles for this right now however. Maybe" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [19:25:07] and of course I see traffic on the ipv6 29418 port :(((( [19:25:11] ssh [19:25:38] hmm [19:25:48] (03PS1) 10EBernhardson: Update apifeatureusage es template to match 5.6.x+ [puppet] - 10https://gerrit.wikimedia.org/r/496557 (https://phabricator.wikimedia.org/T183156) [19:25:53] actually it manages to connect just fine [19:25:59] !log merged Juniper BFD Icinga check [19:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:03] debug3: send packet: type 30 [19:26:03] debug1: sending SSH2_MSG_KEX_ECDH_INIT [19:26:03] debug1: expecting SSH2_MSG_KEX_ECDH_REPLY [19:26:06] and then stall [19:26:17] (03PS2) 10Andrew Bogott: openldap: add read_only switch for ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/496552 (https://phabricator.wikimedia.org/T46722) [19:26:19] (03PS6) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [19:28:40] hashar: is this a new problem? do I need to rollback? [19:28:43] works for me [19:28:51] really? [19:29:06] (03PS3) 10Andrew Bogott: openldap: add read_only switch for ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/496552 (https://phabricator.wikimedia.org/T46722) [19:29:08] (03PS7) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [19:29:14] yup [19:29:20] i git pull'ed on the puppet repo [19:29:25] which i use ssh [19:29:50] sure, but I haven't been able to use: ssh -6 to issue any gerrit commands [19:30:02] oh /me try's that [19:30:46] so yeah ssh -6 is borked somehow :( [19:31:13] (03CR) 10Andrew Bogott: [C: 03+2] openldap: add read_only switch for ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/496552 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [19:31:13] thcipriani works for me [19:31:27] paladox: for which command? [19:31:38] thcipriani https://phabricator.wikimedia.org/P8202 [19:32:10] tcp6 0 0 2620:0:861:3:208::29418 :::* LISTEN 17508/java [19:32:13] that is from netstat [19:32:16] ACKNOWLEDGEMENT - BFD status on cr1-esams is CRITICAL: CRIT: Down: 2 Ayounsi https://phabricator.wikimedia.org/T209989 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:16] ACKNOWLEDGEMENT - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 Ayounsi https://phabricator.wikimedia.org/T209989 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:43] which looks wrong to me [19:32:57] then maybe netstat is broken itself hehe [19:34:36] hashar: it looks like it's not listening on the correct IP [19:34:56] yeah that is what I thought. But I guess netstat just strip the field [19:35:03] lsof is fine [19:35:09] and I can connect to the tcp port at least [19:35:09] https://github.com/wikimedia/puppet/blob/production/hieradata/role/eqiad/gerrit.yaml#L3 [19:35:40] hashar: ok! [19:36:52] indeed --wide / -W shows the proper IP [19:37:11] !log set protocols bgp group Anycast4 multihop ttl 193 on cr1/2-esams - T209989 [19:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:14] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [19:37:50] ironically jenkins-bot is able to connect over ipv6 :/ [19:38:54] as is paladox it would seem [19:39:03] (03PS2) 10Bstorm: Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 [19:39:42] (03CR) 10jerkins-bot: [V: 04-1] Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 (owner: 10Bstorm) [19:40:25] so [19:40:29] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [19:40:36] from contint1001 I can ssh with ipv6 just fine [19:41:19] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is OK: (C)130 ge (W)110 ge 62.97 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [19:41:41] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:41:43] (03CR) 10Ppchelko: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [19:41:54] James_F: no rush, just checking to see if you want to do the SWAT now [19:42:10] I could move it to the later window if needed [19:42:14] thcipriani: I dont get it :( [19:42:33] i see others being able to ssh to gerrit over ipv6 :/ [19:42:37] but [19:42:40] hashar: could you do this with 2.15.8 and now not to 2.15.11? [19:42:51] kostajh: I'm around. [19:42:57] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:43:03] yeah why would it suddenly stop the minute Gerrit got upgrade ? :( [19:43:04] kostajh: Sorry, trying to work out WTF is going on with a deprecation warning. :-) [19:43:16] (03PS3) 10Bstorm: Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 [19:44:29] Fix authentication for LFS over SSH. [19:44:30] 10Operations, 10monitoring, 10netops, 10Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992 (10ayounsi) [19:44:35] that is the only thing mentionning ssh :/ [19:45:02] kostajh: Want me to deploy? [19:45:11] James_F: yes please [19:45:51] kostajh: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/496529 ? [19:46:36] (03PS4) 10Bstorm: Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 [19:46:38] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Core Platform Team Backlog (Watching / External), and 2 others: Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672 (10mobrovac) Would the next step here be puppetising the generation/disseminat... [19:47:42] (03CR) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [19:49:42] (03CR) 10Ppchelko: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [19:49:53] (03CR) 10Ppchelko: [C: 03+1] eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [19:51:45] James_F: yes [19:52:14] * James_F twiddles thumbs waiting for CI. [19:52:50] (03PS6) 10BryanDavis: Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [19:52:53] (03PS1) 10BryanDavis: kubernetes: Set php7.2 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (https://phabricator.wikimedia.org/T188318) [19:53:05] (03PS1) 10BryanDavis: Report error messages on stderr [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 [19:53:05] (03PS1) 10BryanDavis: Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 [19:53:07] (03PS1) 10BryanDavis: Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 [19:55:50] (03CR) 10Bstorm: [C: 03+2] Revert "dumps distribution: remove labstore1006 for failover" [puppet] - 10https://gerrit.wikimedia.org/r/496555 (owner: 10Bstorm) [19:57:14] thcipriani: so yeah ssh to Gerrit 29418 is broken for me over Ipv6 [19:57:32] the KEX init is received [19:57:47] my client send the init and expect a reply which has the server host key [19:57:56] but I receive nothing :/ [19:58:54] hashar: have you watched the gerrit error logs while you do this? anything there? [19:58:59] yeah [19:59:00] nothing [19:59:08] would need to enable bunch of debug logging for ssh [19:59:11] I am tired [19:59:14] ;( [19:59:16] computer sucks [20:01:56] hashar: I can't test this unfortunately :\ the fact that others are able to connect and there's nothing in the error logs makes me relucant to rollback. [20:03:06] oh [20:03:42] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.21/extensions/GrowthExperiments/extension.json: Hot-deploy I19414dc31 to fix dependencies on mw.Uri (duration: 00m 49s) [20:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:12] kostajh: All done. [20:04:31] James_F: cheers [20:04:50] hashar: do you think this is indicative of a wider problem? [20:09:41] thcipriani: sorry had some other issue @home [20:10:09] !log crusnov@deploy1001 Started deploy [netbox/deploy@c6cf7d6]: Minor bugfix releaes for ganeti-netbox script [20:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:26] it worked at 16:00 UTC [20:11:03] !log crusnov@deploy1001 Finished deploy [netbox/deploy@c6cf7d6]: Minor bugfix releaes for ganeti-netbox script (duration: 00m 54s) [20:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:20] thcipriani: I would rather rollback to play it safe [20:11:27] I don't have any indication it comes from my machine [20:11:34] hashar: k [20:11:36] * thcipriani does [20:11:37] ;(((( [20:11:59] guess will have to reproduce at home :/ [20:12:19] (03PS1) 10Bstorm: Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 [20:12:51] (03PS2) 10Bstorm: Revert "dumps distribution: swap do_acme for dumps server failover" [puppet] - 10https://gerrit.wikimedia.org/r/496576 [20:12:53] and if it doesn't work on the old version... then I am doomed [20:13:09] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) a:03holger.knust Verified that we can work with swagger-ui 3+ once we make the spec standard-compliant. Let's begi... [20:13:16] (03PS1) 10Thcipriani: Revert "Gerrit 2.15.11 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/496578 [20:13:23] (03PS3) 10Jforrester: Added a setting to define Wikibase entity types that have no RDF output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [20:13:42] !log Placed labstore1006 back in rotation for NFS and rsync [20:13:43] (03CR) 10Andrew Bogott: [C: 03+2] openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [20:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:53] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2bc8af0]: Revert Gerrit to 2.15.11 on gerrit2001 only [20:13:53] (03PS8) 10Andrew Bogott: openldap: make ldap-eqiad-replica01/02 ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496503 (https://phabricator.wikimedia.org/T46722) [20:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:55] (03CR) 10Jforrester: "Todo once wmf.22 is everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [20:14:03] (03PS3) 10Jforrester: Disable RDF output of mediainfo Wikibase entities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490587 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [20:14:04] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2bc8af0]: Revert Gerrit to 2.15.11 on gerrit2001 only (duration: 00m 10s) [20:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:43] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2bc8af0]: Revert Gerrit to 2.15.11 on cobalt [20:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:50] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2bc8af0]: Revert Gerrit to 2.15.11 on cobalt (duration: 00m 07s) [20:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:07] (03PS1) 10CRusnov: Fix typo in urljoin function name. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496580 [20:15:26] !log restart gerrit on cobalt [20:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:10] !log gerrit back to 2.15.8 [20:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:33] hashar: can you confirm that ssh ipv6 is working for you again? [20:17:43] same, it is still broken [20:17:46] well enough [20:17:52] I am sending my resignation letter [20:17:57] and apply for early retirement [20:18:04] ugh computers. [20:18:09] (03PS3) 10Herron: logstash: send mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/495962 (https://phabricator.wikimedia.org/T213899) [20:18:16] I am done with keyboards internet and computer it is all broken [20:18:32] you have a beef with keyboards? [20:18:38] really [20:18:41] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [20:18:55] I blame some weird MTU with IPv6, that is all what is left [20:19:10] or some wrong ttl somewhere in the path?!::\ [20:19:41] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [20:20:11] thcipriani: sorry I have made you rollback gerrit for nothing :\ [20:20:40] hashar: better safe than sorry. I am going to abandon my revert and roll forward again, sound sane? [20:20:50] +1 ! [20:21:10] (03Abandoned) 10Thcipriani: Revert "Gerrit 2.15.11 release" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/496578 (owner: 10Thcipriani) [20:22:37] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on gerrit2001 only [20:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:42] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on gerrit2001 only (duration: 00m 04s) [20:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:35] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on cobalt [20:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:38] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@2bc8af0]: Gerrit to 2.15.11 on cobalt (duration: 00m 02s) [20:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:43] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [20:23:53] (03CR) 10Herron: logstash: send mediawiki syslogs to logging pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495962 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [20:24:05] !log restarting gerrit for 2.15.11 [20:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:00] !log gerrit live on 2.15.11 [20:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:33] (03PS1) 10CRusnov: Fix ganeti-netbox sync profiles to refer to proper DC. [puppet] - 10https://gerrit.wikimedia.org/r/496606 [20:27:21] (03PS1) 10Bstorm: dumps distribution: set dumps ttl to 5m to prep for failback [dns] - 10https://gerrit.wikimedia.org/r/496607 (https://phabricator.wikimedia.org/T217473) [20:28:07] thcipriani: thank you! [20:28:30] hashar: thanks for watching/checking deployment. I appreciate it :) [20:31:01] (03CR) 10Bstorm: [C: 03+2] dumps distribution: set dumps ttl to 5m to prep for failback [dns] - 10https://gerrit.wikimedia.org/r/496607 (https://phabricator.wikimedia.org/T217473) (owner: 10Bstorm) [20:33:53] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:35:01] 10Operations, 10Operations-Software-Development, 10serviceops, 10User-Joe, and 2 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10crusnov) a:03crusnov [20:35:53] (03PS2) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [20:37:38] (03CR) 10jerkins-bot: [V: 04-1] Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [20:39:57] (03PS4) 10Herron: logstash: send mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/495962 (https://phabricator.wikimedia.org/T213899) [20:40:34] (03PS1) 10Bstorm: dumps distribution: fail dumps back to labstore1006 [dns] - 10https://gerrit.wikimedia.org/r/496614 (https://phabricator.wikimedia.org/T217473) [20:41:31] hashar maybe reboot your computer? [20:41:41] (03CR) 10Herron: [C: 03+1] Fix ganeti-netbox sync profiles to refer to proper DC. [puppet] - 10https://gerrit.wikimedia.org/r/496606 (owner: 10CRusnov) [20:43:19] (03CR) 10Herron: [C: 03+2] logstash: send mediawiki syslogs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/495962 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [20:45:16] (03PS1) 10Andrew Bogott: ldap: added certs for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/496615 (https://phabricator.wikimedia.org/T46722) [20:46:43] (03CR) 10CRusnov: Port MakeVM to cookbook. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [20:47:04] (03PS3) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [20:47:31] (03CR) 10Andrew Bogott: [C: 03+2] ldap: added certs for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/496615 (https://phabricator.wikimedia.org/T46722) (owner: 10Andrew Bogott) [20:47:35] (03PS2) 10CRusnov: Fix ganeti-netbox sync profiles to refer to proper DC. [puppet] - 10https://gerrit.wikimedia.org/r/496606 [20:48:29] (03CR) 10CRusnov: [C: 03+2] Fix ganeti-netbox sync profiles to refer to proper DC. [puppet] - 10https://gerrit.wikimedia.org/r/496606 (owner: 10CRusnov) [20:48:56] (03PS3) 10CRusnov: Fix ganeti-netbox sync profiles to refer to proper DC. [puppet] - 10https://gerrit.wikimedia.org/r/496606 [20:49:10] (03CR) 10jerkins-bot: [V: 04-1] Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [20:49:31] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:50:29] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:55:21] (03CR) 10Herron: [C: 03+1] Fix typo in urljoin function name. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496580 (owner: 10CRusnov) [20:56:21] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Fix typo in urljoin function name. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/496580 (owner: 10CRusnov) [20:59:23] paladox: ;))))))))))))))))))))) [20:59:31] that worked? :) [21:00:33] !log crusnov@deploy1001 Started deploy [netbox/deploy@090a0c3]: Another minor bugfix releaes for ganeti-netbox script [21:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:29] !log crusnov@deploy1001 Finished deploy [netbox/deploy@090a0c3]: Another minor bugfix releaes for ganeti-netbox script (duration: 00m 56s) [21:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:33] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:06:29] burp, probably related to poking it constantly, should be good now. [21:06:57] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [21:09:59] (03PS6) 10Herron: logstash: add udp json logback localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) [21:17:58] 10Operations, 10VisualEditor, 10Readers-Web-Backlog (Tracking), 10Wikimedia-production-error: [Bug] Sporadic 503 errors when editing - https://phabricator.wikimedia.org/T218252 (10Framawiki) That was a (huge) temporary error: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1&from=155249... [21:22:48] 10Operations, 10Discovery, 10Elasticsearch, 10Icinga, and 3 others: Merge http and https elasticsearch icinga checks into one - https://phabricator.wikimedia.org/T215587 (10debt) 05Open→03Resolved [21:27:02] (03CR) 10Gergő Tisza: "Blocked on T218137#5024580." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496210 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:28:13] (03PS1) 10Herron: rsyslog: restart service on kafka_shipper lookup table change [puppet] - 10https://gerrit.wikimedia.org/r/496625 [21:28:51] (03CR) 10Herron: [C: 03+2] logstash: add udp json logback localhost compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496022 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [21:31:30] (03PS3) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [21:32:38] (03CR) 10Ottomata: "@godog I think I need some help with the summary stuff! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) (owner: 10Ottomata) [21:43:08] (03PS4) 10Ottomata: eventgate-analytics - Add statsd prometheus mappings for node-rdkafka-statsd metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/496554 (https://phabricator.wikimedia.org/T218305) [21:48:32] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:48:42] PROBLEM - Apache HTTP on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:49:32] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:49:42] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:58:56] !log cdanis@icinga2001.wikimedia.org ~ % sudo systemctl restart nsca.service [21:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:10] !log cdanis@icinga2001.wikimedia.org ~ % sudo systemctl restart icinga.service [22:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:22] (03PS1) 10Varnent: Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 [22:06:11] (03CR) 10jerkins-bot: [V: 04-1] Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (owner: 10Varnent) [22:08:39] (03PS2) 10Varnent: Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) [22:09:33] (03CR) 10jerkins-bot: [V: 04-1] Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) (owner: 10Varnent) [22:12:42] (03PS3) 10Varnent: Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) [22:13:34] (03CR) 10jerkins-bot: [V: 04-1] Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) (owner: 10Varnent) [22:14:58] (03PS4) 10Varnent: Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) [22:16:47] (03CR) 10jerkins-bot: [V: 04-1] Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) (owner: 10Varnent) [22:18:57] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:19:17] PROBLEM - logstash process on logstash1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash [22:19:41] hurg downtime expired [22:19:57] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [22:20:19] RECOVERY - logstash process on logstash1007 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash [22:22:22] (03PS5) 10Jforrester: Add form for Movement communications group signup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496677 (https://phabricator.wikimedia.org/T218363) (owner: 10Varnent) [22:25:50] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10dduvall) I'm pushing back on the patchset to Blubber for a couple... [22:28:43] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) [22:28:58] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) p:05Triage→03High [22:35:59] (03PS1) 10BryanDavis: toolforge: Cleanup host_aliases and exim4 conf for Trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/496680 (https://phabricator.wikimedia.org/T109485) [22:49:28] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) How about having a semi-config stanza in `blubber.yaml`... [22:49:30] (03PS13) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190314T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:46:59] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) Actually, this will essentially be a replacement for the existing ComCom list. Perhaps for archive preservation it would be better to rename that list?...