[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181221T0000). [00:00:04] Jdlrobson and MR70: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:45] \o [00:02:47] (03PS1) 10Urbanecm: Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481108 (https://phabricator.wikimedia.org/T212469) [00:02:49] (03PS2) 10Urbanecm: Add http://mbc.cyfrowemazowsze.pl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481108 (https://phabricator.wikimedia.org/T212469) [00:06:55] anyone around to do one last swat before xmas? [00:08:10] -_-, Looks like there's no one [00:10:06] twentyafterfour Niharika around? [00:10:21] or MaxSem maybe? [00:10:25] jdlrobson: yo [00:10:46] twentyafterfour: are you able to do a couple of swats to see us into xmas? [00:12:46] jdlrobson: sure [00:13:32] <3 [00:13:35] MR70: ^ [00:17:44] twentyafterfour: When we'll start? [00:17:49] (03PS4) 1020after4: Restore access to Special:Stabilization settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) (owner: 10MR70) [00:18:43] MR70: I'll merge yours first since the other patch will take longer to run tests [00:18:55] (03CR) 1020after4: [C: 03+2] Restore access to Special:Stabilization settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) (owner: 10MR70) [00:19:09] ok [00:19:19] Thanks [00:20:03] (03Merged) 10jenkins-bot: Restore access to Special:Stabilization settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) (owner: 10MR70) [00:20:50] jdlrobson: looks like yours is failing tests [00:20:58] o_O [00:21:35] https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php71-docker/1446/ [00:21:52] it shouldn't be... [00:23:04] hmm it seems like it's unrelated to the patch [00:23:29] why are we running npm to test a php change [00:24:27] twentyafterfour, because everything can break anything [00:25:51] (03CR) 10jenkins-bot: Restore access to Special:Stabilization settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) (owner: 10MR70) [00:26:07] MR70: can you test your change on mwdebug1001? [00:26:11] it should be live [00:26:17] ok i'll [00:26:47] Yeah it works [00:27:52] thanks, syncing it everywhere [00:41:20] gate-and-submit-swat looking a bit more promising 2nd time round.. [00:41:29] (touch wood) [00:44:05] jdlrobson: indeed, it merged [00:44:47] w00t [00:45:12] (03PS1) 10Cwhite: mediawiki: enable statsd_exporter and add matching rules to appserver [puppet] - 10https://gerrit.wikimedia.org/r/481110 (https://phabricator.wikimedia.org/T205870) [00:46:05] jdlrobson: it should be live on mwdebug1001, care to test? [00:46:43] yu[ [00:47:17] testing complete! [00:47:20] twentyafterfour: you can sync! [00:50:21] syncing [00:50:59] !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.9/extensions/MobileFrontend/: SWAT: sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/481026 (duration: 00m 48s) [00:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:09] !log SWAT Finished. See you all next year! [00:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:48] w00t thanks twentyafterfour for stepping up [00:56:08] Thanks twentyafterfour. Happy New Year! [01:05:06] (03PS10) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [02:40:51] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1var-server=thumbor1004var-datasource=eqiad%2520prometheus%252Fops [02:54:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [02:55:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [02:55:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [03:01:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [03:30:42] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Krinkle) [03:36:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 916.09 seconds [04:16:22] (03PS1) 10Gergő Tisza: Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [04:16:24] (03PS1) 10Gergő Tisza: Remove unnecessary exception handling from wfGetPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481116 [04:19:43] (03PS2) 10Gergő Tisza: Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [04:19:45] (03PS2) 10Gergő Tisza: Remove unnecessary exception handling from wfGetPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481116 [04:59:07] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 246.09 seconds [05:02:35] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [05:09:49] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [05:52:37] (03PS4) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [05:53:31] (03CR) 10jerkins-bot: [V: 04-1] role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [05:55:39] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php::monitoring: fine-grained opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/480714 (https://phabricator.wikimedia.org/T211964) [05:58:40] > !log SWAT Finished. See you all next year! [05:58:40] wow, the last SWAT deployment in 2018? [05:59:48] (03PS5) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [06:00:38] <_joe_> takidelfin: no one wants to get called up during the Sol Invictus festivities (https://en.wikipedia.org/wiki/Sol_Invictus) [06:00:48] (03CR) 10jerkins-bot: [V: 04-1] role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [06:01:04] <_joe_> so yes, next week there will be no deployment [06:01:20] :O [06:02:20] <_joe_> that's pretty clearly stated on the deployments page on wikitech, IIRC [06:05:16] oh, thanks! [06:21:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [06:22:55] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [06:25:17] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [06:25:23] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [06:26:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [06:28:57] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:29:51] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:30:11] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-intel-microcode] [06:30:53] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_confd_lint] [06:32:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [06:33:17] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:33:19] this link seems not right --^ [06:33:28] s/.json// [06:36:05] (03PS1) 10Elukey: role::graphite::alerts::reqstats: fix dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/481117 [06:37:50] <_joe_> elukey: it's all the logrotate proiblem again [06:38:16] <_joe_> we solved that on a preceding puppet version, has started happening again lately [06:38:36] do you mean the varnish 5xx? [06:38:58] <_joe_> oh sorry, no [06:39:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php::monitoring: fine-grained opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/480714 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [06:39:39] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:51:36] (03PS1) 10Marostegui: db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481118 (https://phabricator.wikimedia.org/T211973) [06:53:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481118 (https://phabricator.wikimedia.org/T211973) (owner: 10Marostegui) [06:54:10] (03Merged) 10jenkins-bot: db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481118 (https://phabricator.wikimedia.org/T211973) (owner: 10Marostegui) [06:54:55] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:55:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove old comments T211973 (duration: 00m 46s) [06:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:31] T211973: Check GTID, consistency options, notifications across the fleet and db-eqiad.php weights - https://phabricator.wikimedia.org/T211973 [06:55:32] (03CR) 10jenkins-bot: db-eqiad.php: Remove old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481118 (https://phabricator.wikimedia.org/T211973) (owner: 10Marostegui) [06:56:05] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:07] (03PS6) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [06:56:53] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:17] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:34:24] (03CR) 10Mobrovac: "Looking good!" [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [07:40:09] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [07:41:26] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10MoritzMuehlenhoff) There should also be a number of additional test cases in Phab: https://phabricator.wikimedia.org/tag/wikimedia-svg-rendering/ IMHO it m... [07:47:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [07:55:33] (03PS8) 10Mathew.onipe: cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [07:56:21] (03CR) 10jerkins-bot: [V: 04-1] cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [07:58:57] (03PS1) 10Elukey: hadoop::directory: use_kerberos parameter must be set for cdh::exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/481123 [08:00:58] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10Marostegui) 05Open→03Resolved This got fixed by itself - maybe it was fixed with the last reboot? ` ssh es2001.mgmt.codfw.wmnet -lroot root@es2001.mgmt.codfw.wmnet's password: /admin... [08:01:48] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14039/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/481123 (owner: 10Elukey) [08:04:13] (03PS9) 10Mathew.onipe: cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [08:04:47] (03CR) 10jerkins-bot: [V: 04-1] cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [08:05:18] (03PS1) 10Elukey: Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/481124 [08:05:47] (03PS2) 10Elukey: Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/481124 [08:07:28] (03CR) 10Elukey: [C: 03+2] Update cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/481124 (owner: 10Elukey) [08:19:41] (03PS1) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [08:23:55] (03CR) 10Hashar: [C: 03+1] Add jenkins-agent user to releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [08:24:13] (03PS1) 10Elukey: profile::hadoop::spark: use cdh::exec where needed [puppet] - 10https://gerrit.wikimedia.org/r/481127 [08:26:19] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) > the request to index.php is conditionally routed directly to the SSR service. In our world, the SSR service is there, so w... [08:28:18] 10Operations: swift-recon-cron - cffi library '_openssl' has no function, constant or global variable named 'sk_H509_NAME]ENTRY_value' - https://phabricator.wikimedia.org/T212439 (10fgiunchedi) [08:28:23] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) [08:30:46] (03CR) 10Filippo Giunchedi: "LGTM, mind changing other grafana links that point to .json ? IIRC the only other is swift" [puppet] - 10https://gerrit.wikimedia.org/r/481117 (owner: 10Elukey) [08:38:26] (03CR) 10Mathew.onipe: "I dunno if this solves the problem" [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [08:38:50] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14040/" [puppet] - 10https://gerrit.wikimedia.org/r/481127 (owner: 10Elukey) [08:38:55] (03CR) 10Mathew.onipe: "PCC thinks so: https://puppet-compiler.wmflabs.org/compiler1002/14041/" [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [08:41:04] !log upgrading nginx on debug proxies [08:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) (owner: 10BryanDavis) [08:42:56] (03PS1) 10Elukey: profile::hadoop::backup::namenode: add kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/481128 [08:42:58] (03CR) 10Elukey: "sure!" [puppet] - 10https://gerrit.wikimedia.org/r/481117 (owner: 10Elukey) [08:45:15] !log upgrading nginx on sodium [08:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:11] (03PS3) 10Muehlenhoff: Remove obsolete rsync::repo [puppet] - 10https://gerrit.wikimedia.org/r/470611 [08:56:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete rsync::repo [puppet] - 10https://gerrit.wikimedia.org/r/470611 (owner: 10Muehlenhoff) [08:57:24] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14042/" [puppet] - 10https://gerrit.wikimedia.org/r/481128 (owner: 10Elukey) [08:57:33] (03PS2) 10Elukey: profile::hadoop::backup::namenode: add kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/481128 [08:57:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::hadoop::backup::namenode: add kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/481128 (owner: 10Elukey) [09:03:53] (03PS1) 10Alexandros Kosiaris: Narrow down ferm etcd allow_from [puppet] - 10https://gerrit.wikimedia.org/r/481132 [09:03:58] (03PS2) 10Elukey: Fix grafana's dashboard links using the .json suffix [puppet] - 10https://gerrit.wikimedia.org/r/481117 [09:04:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [09:05:42] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [09:12:05] 10Operations, 10Patch-For-Review, 10User-Elukey: tmpreaper doesn't play along with PrivateTmp systemd units - https://phabricator.wikimedia.org/T185195 (10MoritzMuehlenhoff) tmpreaper::reap doesn't seem to be used at all (at least in production)? I think we could either extend the tmpreaper.conf and pass it... [09:12:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] Narrow down ferm etcd allow_from [puppet] - 10https://gerrit.wikimedia.org/r/481132 (owner: 10Alexandros Kosiaris) [09:19:02] (03CR) 10Muehlenhoff: "When making this change consider to also extend the superset.service with PrivateTmp=true. This will mount /tmp to a private namespace (se" [puppet] - 10https://gerrit.wikimedia.org/r/479408 (owner: 10Elukey) [09:22:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/481117 (owner: 10Elukey) [09:22:55] !log depool ms-fe2006 to test new TLS certs T212215 [09:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:58] T212215: Update Subject Alternative Name field in TLS certificates for swift - https://phabricator.wikimedia.org/T212215 [09:23:00] (03CR) 10Muehlenhoff: druid: reserve middlemanager ports from 8200 onward (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480733 (https://phabricator.wikimedia.org/T204979) (owner: 10Elukey) [09:23:20] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2006.codfw.wmnet [09:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:50] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [09:25:11] (03CR) 10Elukey: [C: 03+2] druid: reserve middlemanager ports from 8200 onward (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480733 (https://phabricator.wikimedia.org/T204979) (owner: 10Elukey) [09:36:13] 10Operations, 10Traffic, 10media-storage: Update Subject Alternative Name field in TLS certificates for swift - https://phabricator.wikimedia.org/T212215 (10ema) Tested the new cert on ms-fe2006, looks good: ` $ echo | openssl s_client -connect ms-fe2006.codfw.wmnet:443 2>&1 | openssl x509 -noout -text | gr... [09:44:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:49:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:52:40] (03CR) 10Elukey: [C: 03+2] Fix grafana's dashboard links using the .json suffix [puppet] - 10https://gerrit.wikimedia.org/r/481117 (owner: 10Elukey) [09:52:51] (03PS3) 10Elukey: Fix grafana's dashboard links using the .json suffix [puppet] - 10https://gerrit.wikimedia.org/r/481117 [09:57:46] !log repool ms-fe2006 with old certs, test successful T212215#4839960 [09:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:49] T212215: Update Subject Alternative Name field in TLS certificates for swift - https://phabricator.wikimedia.org/T212215 [09:58:57] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2006.codfw.wmnet [09:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:20] (03CR) 10DCausse: [C: 04-1] elasticsearch: allow cross cluster communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [10:13:06] (03PS1) 10Ema: swift: new cert for ms-fe.svc.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/481136 (https://phabricator.wikimedia.org/T212215) [10:13:08] (03PS1) 10Ema: swift: new cert for ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/481137 (https://phabricator.wikimedia.org/T212215) [10:16:49] (03PS1) 10Elukey: druid::middlemanager: fix argument documentation [puppet] - 10https://gerrit.wikimedia.org/r/481138 [10:17:48] (03CR) 10Elukey: [C: 03+2] druid::middlemanager: fix argument documentation [puppet] - 10https://gerrit.wikimedia.org/r/481138 (owner: 10Elukey) [10:28:14] (03PS1) 10Muehlenhoff: Align thumbor.profile and mediawiki-converters.profile [puppet] - 10https://gerrit.wikimedia.org/r/481139 [10:28:16] (03PS1) 10Muehlenhoff: Also limit file size in mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/481140 [10:28:18] (03PS1) 10Muehlenhoff: Re-add blacklist for /sbin in thumbor profile [puppet] - 10https://gerrit.wikimedia.org/r/481141 [10:28:20] (03PS1) 10Muehlenhoff: Switch /etc/firejail/thumbor.profile to the mediawiki profile [puppet] - 10https://gerrit.wikimedia.org/r/481142 [10:28:22] (03PS1) 10Muehlenhoff: Remove thumbor.profile.firejail [puppet] - 10https://gerrit.wikimedia.org/r/481143 [10:31:32] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) Let me state it again: the SSR service should not need to call the mediawiki api. It should accept all the information needed to rend... [10:36:42] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4838090, @Addshore wrote: > > The "termbox" is more of an application than a template. > Only it knows which data it... [10:40:33] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) Also, if we're going to build microservices, I'd like to **not** see applications that "grow", at least in terms of what they can do.... [10:48:30] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4839848, @WMDE-leszek wrote: > The intention of introducing the service is not to have a service that call Mediawiki.... [10:53:38] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10faidon) @Cmjohnson what's the status of this? [11:02:53] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:03:47] (03CR) 10Hashar: "Sorry for the spam, I have used this change to test a migration of doc.wikimedia.org to a new host ( T137890 )." [software/spicerack] - 10https://gerrit.wikimedia.org/r/480757 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:09:03] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) To avoid misunderstandings: I was not questioning MediaWiki's action API being performant. By "lightweight" I was referring t... [11:10:53] (03CR) 10GTirloni: [C: 03+2] php72: add RSVG [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/480159 (https://phabricator.wikimedia.org/T151656) (owner: 10MaxSem) [11:10:54] (03CR) 10Gilles: [C: 03+1] Align thumbor.profile and mediawiki-converters.profile [puppet] - 10https://gerrit.wikimedia.org/r/481139 (owner: 10Muehlenhoff) [11:11:16] (03Merged) 10jenkins-bot: php72: add RSVG [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/480159 (https://phabricator.wikimedia.org/T151656) (owner: 10MaxSem) [11:11:26] jouncebot: next [11:11:26] In 407 hour(s) and 18 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T1030) [11:11:32] nice [11:13:08] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Joe) >>! In T212189#4840039, @WMDE-leszek wrote: > To avoid misunderstandings: I was not questioning MediaWiki's action API being performa... [11:17:19] (03CR) 10Gilles: "That looks fine, but I can't be around for any fallout that might happen, as I'm vanishing for the next 2 weeks in a few hours. Let's depl" [puppet] - 10https://gerrit.wikimedia.org/r/481141 (owner: 10Muehlenhoff) [11:34:11] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) [11:37:15] (03CR) 10Muehlenhoff: "Definitely, that wasn't meant to be merged until next year :-)" [puppet] - 10https://gerrit.wikimedia.org/r/481141 (owner: 10Muehlenhoff) [11:38:59] !log rebooting debug proxies to pick up SSBD-enabled QEMU [11:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:51] 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10faidon) 1) On received routes: I don't think we should be making these kind of community-matching in `BGP_community_actions`. Rather, I think we should have `ASnnnn_in` policy-statements, that map our... [11:46:13] (03CR) 10Mathew.onipe: "Reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [11:47:40] (03PS2) 10Arturo Borrero Gonzalez: openstack: introduce nova templates for newton [puppet] - 10https://gerrit.wikimedia.org/r/481006 (https://phabricator.wikimedia.org/T212302) [11:49:24] (03PS3) 10Arturo Borrero Gonzalez: openstack: introduce templates for newton [puppet] - 10https://gerrit.wikimedia.org/r/481006 (https://phabricator.wikimedia.org/T212302) [11:51:26] (03PS4) 10Arturo Borrero Gonzalez: openstack: introduce templates for newton [puppet] - 10https://gerrit.wikimedia.org/r/481006 (https://phabricator.wikimedia.org/T212302) [11:55:52] (03PS5) 10Arturo Borrero Gonzalez: openstack: introduce templates for newton [puppet] - 10https://gerrit.wikimedia.org/r/481006 (https://phabricator.wikimedia.org/T212302) [11:57:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Compiler happy: https://puppet-compiler.wmflabs.org/compiler1002/14048/" [puppet] - 10https://gerrit.wikimedia.org/r/481006 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [12:02:48] (03PS1) 10Faidon Liambotis: [WIP] monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 [12:03:18] (03PS2) 10Faidon Liambotis: [WIP] monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) [12:05:44] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: compute: the libvirt service in stretch depends on other pkgs [puppet] - 10https://gerrit.wikimedia.org/r/481155 (https://phabricator.wikimedia.org/T212302) [12:06:59] 10Operations, 10monitoring, 10netops, 10Patch-For-Review: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264 (10faidon) a:05faidon→03ayounsi I pushed what I had written a while ago in Gerrit (see above). It needs to be hooked up to our monitoring, but it should be in a working condition.... [12:07:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Compilation seems fine: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/14049/console" [puppet] - 10https://gerrit.wikimedia.org/r/481155 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [12:08:00] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) = rsyslog = ==... [12:10:35] !log rebooting url downloaders to pick up SSBD-enabled QEMU [12:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:50] (03PS1) 10Faidon Liambotis: [WIP] monitoring: rewrite check_jnx_alarms in Python [puppet] - 10https://gerrit.wikimedia.org/r/481157 [12:23:05] (03PS1) 10Arturo Borrero Gonzalez: openstack: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [12:24:42] (03PS3) 10Reedy: Re-enable EP namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478473 (https://phabricator.wikimedia.org/T211494) [12:25:04] (03PS2) 10Arturo Borrero Gonzalez: openstack: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [12:27:53] (03CR) 10Reedy: [C: 03+2] Re-enable EP namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478473 (https://phabricator.wikimedia.org/T211494) (owner: 10Reedy) [12:28:58] (03Merged) 10jenkins-bot: Re-enable EP namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478473 (https://phabricator.wikimedia.org/T211494) (owner: 10Reedy) [12:31:33] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T211494 (duration: 00m 45s) [12:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:37] T211494: Re-enable EP Namespaces in wmf-config - https://phabricator.wikimedia.org/T211494 [12:34:09] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: T211494 (duration: 00m 44s) [12:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:13] (03CR) 10jenkins-bot: Re-enable EP namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478473 (https://phabricator.wikimedia.org/T211494) (owner: 10Reedy) [12:47:44] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10fselles) As discussed over... [12:48:57] (03PS3) 10Arturo Borrero Gonzalez: openstack: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [12:50:49] _joe_: re https://phabricator.wikimedia.org/T212189 we are going to have a meeting on the 3rd and discuss all of the points that have been raised so far :) [12:51:13] <_joe_> addshore: if you need me to be there, invite me [12:51:38] <_joe_> oh and also, we now have a new channel, #wikimedia-serviceops, which is less noisy than this one [12:51:42] ooooh [12:52:00] <_joe_> and should be the right place to discuss these topics with the SRE serviceops team (my team :P) [12:52:06] (03PS4) 10Arturo Borrero Gonzalez: openstack: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [12:56:19] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries for labstore1001/labstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/481159 (https://phabricator.wikimedia.org/T187456) [13:12:58] (03PS5) 10Arturo Borrero Gonzalez: openstack: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [13:22:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "mcrouter: temporary remove mc2033 to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/480929 (owner: 10Elukey) [13:24:08] !log installing subversion updates from stretch point release [13:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove obsolete Hiera entries for labstore1001/labstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/481159 (https://phabricator.wikimedia.org/T187456) (owner: 10Muehlenhoff) [13:29:10] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "mcrouter: temporary remove mc2033 to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/480929 (owner: 10Elukey) [13:29:20] (03PS3) 10Effie Mouzeli: Revert "mcrouter: temporary remove mc2033 to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/480929 (owner: 10Elukey) [13:29:31] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) These updates have been fully deployed: ` fuse libdap libxcursor subversion xapian-core ` [13:31:17] (03CR) 10DCausse: [C: 04-1] elasticsearch: allow cross cluster communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [13:32:21] (03PS6) 10Arturo Borrero Gonzalez: openstack: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [13:33:57] (03PS7) 10Arturo Borrero Gonzalez: openstack: nova: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) [13:48:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: introduce config files for newton [puppet] - 10https://gerrit.wikimedia.org/r/481158 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [13:58:33] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudvirt1030: override interface names for bridge mapping [puppet] - 10https://gerrit.wikimedia.org/r/481161 (https://phabricator.wikimedia.org/T212302) [13:59:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudvirt1030: override interface names for bridge mapping [puppet] - 10https://gerrit.wikimedia.org/r/481161 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [14:05:42] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [14:06:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:08:18] (03CR) 10CDanis: [C: 03+1] "A little late to the party, but looks great -- thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/481117 (owner: 10Elukey) [14:16:44] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [14:19:04] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:20:45] !log elastic@eqiad deleting unused index enwiki_general_1537906513 [14:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:09] (03CR) 10DCausse: cirrus: increase number of shards (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [14:31:41] (03PS2) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [14:32:25] (03CR) 10Mathew.onipe: elasticsearch: allow cross cluster communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [14:33:01] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Milimetric) @WMDE-leszek ok, we're on the same page, except the crazy part of my proposal. I was saying **directly** routed to SSR servic... [14:51:22] (03PS1) 10ArielGlenn: make all snapshot hosts use php7.2 for dumps [puppet] - 10https://gerrit.wikimedia.org/r/481167 (https://phabricator.wikimedia.org/T211935) [14:58:12] (03PS1) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 [15:02:08] (03PS3) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [15:02:13] (03PS2) 10ArielGlenn: make all snapshot hosts use php7.2 for dumps [puppet] - 10https://gerrit.wikimedia.org/r/481167 (https://phabricator.wikimedia.org/T211935) [15:16:13] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) @Joe said: > the SSR service should not need to call the mediawiki api. It should accept all the information needed to render the... [15:16:56] 10Operations: Remove OOMScoreAdjust from nrpe unit file? - https://phabricator.wikimedia.org/T212504 (10BBlack) p:05Triage→03Low [15:20:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [15:23:20] (03PS4) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [15:28:23] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10chasemp) > We could have security@tools.wmflabs.org go to the Toolforge admins +1 [15:29:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Align thumbor.profile and mediawiki-converters.profile [puppet] - 10https://gerrit.wikimedia.org/r/481139 (owner: 10Muehlenhoff) [15:30:33] (03CR) 10Filippo Giunchedi: [C: 03+1] Re-add blacklist for /sbin in thumbor profile [puppet] - 10https://gerrit.wikimedia.org/r/481141 (owner: 10Muehlenhoff) [15:33:06] 10Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713 (10RobH) [15:33:10] 10Operations, 10Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112 (10RobH) [15:33:15] 10Operations, 10WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38 (10RobH) 05Open→03Resolved a:03RobH >>! In T38#4839437, @Aklapper wrote: > Cannot see all subtasks but I guess this task could be closed as resolved now? I can, and yeah it is old and long done! Resolved [15:34:04] 10Operations, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10RobH) [15:35:25] 10Operations, 10WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38 (10Dzahn) Though, only the major ticket queues have been imported to Phab and not really all tickets. [15:39:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [15:40:20] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [15:47:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [15:47:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [15:49:31] it sems like we have a lot of 5xx flappiness lately :/ [15:51:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [15:52:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [15:53:24] indeed, a single client triggering the 5xx as an absolute number alerts afaics [15:53:46] arguably, that's not the client's fault, it's our fault. [15:54:15] we should handle inappropriate inputs and deal with them without throwing a 5xx or timing out or whatever. A 5xx/timeout indicates our code fails for those inputs. [15:57:05] agreed in general, in this case we're returning 503s though I can download the same url just fine [15:57:47] and relatedly, ditch the alert for absolute number of 5xx but keep the relative one [15:58:00] relative meaning the availability alerts [15:58:30] in this particular case that seems sensible, but in general I prefer the absolute one to the relative one. [15:58:54] if we emit any 5xx, it's something we should be debugging on our end, and the relative checks are often false-tripped by depools of sites, etc. [15:59:17] (03PS5) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [15:59:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [16:00:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [16:00:54] 10Operations, 10WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38 (10RobH) >>! In T38#4840473, @Dzahn wrote: > Though, only the major ticket queues have been imported to Phab and not really all tickets. I didn't think we were going to migrate the rest, should this be re-opened? [16:01:31] we do have a baseline of 5xx though that we know and accept, image resizing being the typical one [16:01:45] I think also why the relative alert gets tripped on low traffic [16:02:05] yeah but that really shouldn't be the case, it's just a quirk of how we do things today [16:04:25] getting a bit into philosophy territory here, but if we can't generate a thumb, is that client or server error? [16:04:50] PROBLEM - ensure kvm processes are running on labvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm [16:05:17] (03PS4) 10Mforns: Adjust params for Analytics data_purge EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/478129 (https://phabricator.wikimedia.org/T202429) [16:05:24] That labvirt1013 issue is me, I will silence [16:05:56] well, the sub-cases are that we can't generate a thumb because it's a bad request (e.g. beyond some reasonable limit like a 100000px thumb, or for some image format we don't support thumbing, etc) [16:05:58] ACKNOWLEDGEMENT - ensure kvm processes are running on labvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm andrew bogott this is on purpose :) [16:06:03] in which case it should be a 4xx [16:06:25] and we can't generate a thumb because it's the first time that particular thumb was requested and we have to go generate it async for later requests. [16:07:18] (03PS6) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:07:58] on this, I guess we could either declare that it's normal that thumbs take time to appear after being first requested and generate a 4xx while we're waiting, or we could fix it to stall on the generation? at least from the external POV. [16:08:02] (03PS2) 10ArielGlenn: convert snapshot/dumps python scripts in puppet to python3 [puppet] - 10https://gerrit.wikimedia.org/r/477222 (https://phabricator.wikimedia.org/T210980) [16:08:31] (but also, it's kind of not-great that we allow arbitrary thumbnailing based on whatever clients ask for anyways) [16:09:10] it's not an easy problem to solve design-wise, given where we're at today [16:09:12] IIRC since thumbor we're enforcing thumb width to not exceed original width for raster formats [16:09:38] the tricky cases IMHO are when we're hitting e.g. rsvg bugs, that's a 500 [16:09:48] but I'd favor a solution with a fixed small set of available thumb sizes that covers most reasonable cases, and having the thumber be fast enough that we can just stall on first request. [16:10:18] (and use the size attributes in the enclosing html to do arbitrary other sizes for article formatting from the most-appropriate fixed size, to handle that case) [16:11:14] (03CR) 10Mforns: "Ottomata, I think this can be merged now: the refactor allows to pass a refineMonitorClass but defaults to the previous one, so current jo" [puppet] - 10https://gerrit.wikimedia.org/r/478129 (https://phabricator.wikimedia.org/T202429) (owner: 10Mforns) [16:11:41] (we should probably sanitize/validate svg on the upload side?) [16:12:01] but yeah, it's all tricky [16:12:27] still, I don't think any of it amounts to a solid refutation of the idea that all public-facing 5xx are our code/infra's fault [16:13:23] (03PS7) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:13:24] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.006 second response time [16:14:01] I think of the 5xx as the public HTTP API version of "my code crashed" when talking about local apps. Nothing should result in a crash if you don't have bugs and validate inputs. [16:14:12] That toolschecker alert is expected. andrewbogott is moving the grid master right now [16:14:53] (not that code is ever bug-free, but the point is that the 5xx always means you have a bug to track down) [16:15:05] and we almost made it without tripping the alert, the move is just about done [16:16:34] PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 9.274 second response time [16:17:06] (03PS8) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:19:06] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [16:20:12] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [16:21:19] (03PS9) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:21:38] RECOVERY - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.393 second response time [16:21:54] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.009 second response time [16:27:50] (03PS10) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:37:12] (03PS1) 10Andrew Bogott: Move labvirt1013 to cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/481185 [16:40:19] (03PS2) 10Andrew Bogott: Move labvirt1013 to cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/481185 [16:43:10] (03PS11) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:51:09] (03PS12) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [16:59:43] (03CR) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints on Beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [17:04:39] (03CR) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [17:05:04] (03PS1) 10Andrew Bogott: rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 [17:05:22] (03CR) 10jerkins-bot: [V: 04-1] rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (owner: 10Andrew Bogott) [17:08:55] (03PS13) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [17:09:24] (03PS3) 10Andrew Bogott: Move labvirt1013 to cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/481185 (https://phabricator.wikimedia.org/T212513) [17:10:16] (03PS2) 10Andrew Bogott: rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (https://phabricator.wikimedia.org/T212513) [17:10:29] (03CR) 10jerkins-bot: [V: 04-1] rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (https://phabricator.wikimedia.org/T212513) (owner: 10Andrew Bogott) [17:11:46] (03PS3) 10Andrew Bogott: rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (https://phabricator.wikimedia.org/T212513) [17:12:07] (03PS14) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [17:12:34] (03CR) 10jerkins-bot: [V: 04-1] rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (https://phabricator.wikimedia.org/T212513) (owner: 10Andrew Bogott) [17:13:16] (03PS4) 10Andrew Bogott: rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (https://phabricator.wikimedia.org/T212513) [17:15:12] (03CR) 10Andrew Bogott: [C: 03+2] Move labvirt1013 to cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/481185 (https://phabricator.wikimedia.org/T212513) (owner: 10Andrew Bogott) [17:16:19] (03CR) 10Andrew Bogott: [C: 03+2] rename labvirt1013 to cloudvirt1013 [dns] - 10https://gerrit.wikimedia.org/r/481188 (https://phabricator.wikimedia.org/T212513) (owner: 10Andrew Bogott) [17:17:04] (03PS15) 10Mathew.onipe: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) [17:30:52] (03PS1) 10Arturo Borrero Gonzalez: openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) [17:31:19] (03CR) 10jerkins-bot: [V: 04-1] openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [17:42:56] (03PS2) 10Arturo Borrero Gonzalez: openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) [17:44:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [17:48:27] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10RobH) [17:48:57] (03PS3) 10Arturo Borrero Gonzalez: openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) [17:49:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [17:51:33] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10RobH) [17:52:36] 10Operations, 10decommission, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10RobH) a:05RobH→03fgiunchedi @fgiunchedi: can you confirm these are ready for reclaim and disk wipe? I claimed it, but I likely should have checked with you first! [17:53:01] (03PS4) 10Arturo Borrero Gonzalez: openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) [17:53:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: virt: introduce per-component per-openstack per-distro classes [puppet] - 10https://gerrit.wikimedia.org/r/481194 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [17:54:47] 10Operations, 10Cloud-Services, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10RobH) [17:55:39] 10Operations, 10DC-Ops, 10decommission: decom californium - https://phabricator.wikimedia.org/T189921 (10RobH) [17:56:56] (03PS1) 10RobH: decom californium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/481195 (https://phabricator.wikimedia.org/T189921) [17:57:20] 10Operations, 10Wikimedia-General-or-Unknown, 10Security: Massive spambot registrations at dinwiki - https://phabricator.wikimedia.org/T212519 (10MarcoAurelio) [17:57:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10RobH) [17:57:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10RobH) a:03Cmjohnson [17:59:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom californium - https://phabricator.wikimedia.org/T189921 (10RobH) [17:59:11] .win 30 [18:00:15] 10Operations, 10ops-eqiad, 10decommission: Decommission conf100[1-3] - https://phabricator.wikimedia.org/T206626 (10RobH) a:03RobH [18:10:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1013 to cloudvirt1013 - https://phabricator.wikimedia.org/T212522 (10Andrew) [18:11:47] (03PS1) 10Andrew Bogott: cloudvirt1013: enable alerts [puppet] - 10https://gerrit.wikimedia.org/r/481197 (https://phabricator.wikimedia.org/T212513) [18:16:22] 10Operations, 10Wikimedia-Mailing-lists: Request to create mailing list for Wikimedians of Chicago User Group - https://phabricator.wikimedia.org/T212266 (10colewhite) p:05Triage→03Normal a:03colewhite [18:20:06] 10Operations, 10Wikimedia-General-or-Unknown, 10Security: Massive spambot registrations at dinwiki - https://phabricator.wikimedia.org/T212519 (10MarcoAurelio) [18:35:25] 10Operations, 10WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38 (10Dzahn) @Robh I guess it's "rejected", i just wanted to point out it was never fully done so people are aware of that. [18:41:41] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:47] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 1.240 second response time [18:43:11] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10crusnov) [18:44:02] apergos: I'm wondering about suppression in dumps--until a moment ago I was planning to include page_title and author_username in a dump of ORES revision scores. Because these might contain bad content however, maybe I need to omit from dumps? [18:44:51] I think we should look at our policies for other dumps. [18:45:01] 10Operations, 10Wikimedia-Mailing-lists: Administrator password recovery for wmfaliens@lists.wikimedia.org - https://phabricator.wikimedia.org/T212525 (10Elena) [18:48:59] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:25] !log [scb1001:~] $ sudo systemctl restart pdfrender [18:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:15] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [18:55:05] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10crusnov) [19:04:46] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) [19:05:14] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) [19:14:25] (03CR) 10Dzahn: "i am not really clear on this change yet, it says "fix loading of extensions" but looking at the code it removes an extension and loads ot" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [19:18:32] (03PS11) 10Dzahn: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [19:18:35] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [19:34:29] (03PS1) 10Hashar: contint: remove unused classes [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) [19:36:36] (03PS12) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [19:51:27] (03PS2) 10Dzahn: switch graphite host for dev_cluster from graphite1003 to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/477602 (https://phabricator.wikimedia.org/T209357) [19:51:44] (03PS3) 10Dzahn: switch graphite host for dev_cluster from graphite1003 to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/477602 (https://phabricator.wikimedia.org/T209357) [19:52:05] 10Operations, 10Wikimedia-General-or-Unknown, 10Security: Massive spambot registrations at dinwiki - https://phabricator.wikimedia.org/T212519 (10Billinghurst) @Whatamidoing-WMF a local example of our recent conversation [19:52:10] (03PS13) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [19:52:15] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [19:52:28] (03CR) 10Dzahn: [C: 03+2] "per godog, setting to bogus value instead of new server name, should be unused, adding FIXME comment though" [puppet] - 10https://gerrit.wikimedia.org/r/477602 (https://phabricator.wikimedia.org/T209357) (owner: 10Dzahn) [19:52:56] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [19:53:30] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [19:54:06] (03PS14) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [19:55:39] (03PS15) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [19:55:44] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:01:32] (03PS2) 10Dzahn: switch graphite host for prod cassandra from graphite1003 to 'none' [puppet] - 10https://gerrit.wikimedia.org/r/477604 (https://phabricator.wikimedia.org/T209357) [20:04:00] (03PS2) 10Dzahn: contint: remove unused classes [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [20:04:09] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/481201 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [20:05:17] (03PS16) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [20:06:44] (03CR) 10Dzahn: [C: 03+2] "setting to none because the current graphite1003 can't be used, it's decom'ed" [puppet] - 10https://gerrit.wikimedia.org/r/477604 (https://phabricator.wikimedia.org/T209357) (owner: 10Dzahn) [20:07:08] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:07:33] (03PS17) 10Dzahn: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:18:54] (03PS18) 10Paladox: phabricator: Fix loading of php-extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 [20:19:11] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:21:30] (03PS1) 10Bstorm: sonofgridengine: add submit host functions to web exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/481202 [20:22:35] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10bd808) >>! In T182812#4840433, @chasemp wrote: >> We could have security@tools.wmflabs.org go to the Toolforge admins > > +1 Th... [20:22:55] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: add submit host functions to web exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/481202 (owner: 10Bstorm) [20:24:20] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10Dzahn) Sounds like we can close this then as either 'resolved' or technically 'rejected' i guess. [20:29:20] (03PS19) 10Dzahn: phabricator: Load mysqlnd extension before other PHP extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:29:44] (03PS20) 10Dzahn: phabricator: Load mysqlnd extension before other PHP extensions [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:30:56] (03CR) 10Dzahn: [C: 03+2] "does not affect current production instance, just stretch (1002)" [puppet] - 10https://gerrit.wikimedia.org/r/479909 (owner: 10Paladox) [20:35:52] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10bd808) 05Open→03Declined Declined per @faidon in T182812#4629616 and the valid concerns he raised about cross-domain aliasing. [20:38:00] !log phab1002 - restart php-fpm, restart phd for testing. phd fails [20:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:39] PROBLEM - Check systemd state on phab1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:41:01] (03PS1) 10Bstorm: sonofgridengine: remove more unneeded logging [puppet] - 10https://gerrit.wikimedia.org/r/481203 [20:42:21] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: remove more unneeded logging [puppet] - 10https://gerrit.wikimedia.org/r/481203 (owner: 10Bstorm) [21:04:33] !log phab1002 - removing all php related packages and letting puppet reinstall them [21:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:41] (03PS1) 10Bstorm: sonofgridengine: allow hosts to have multiple roles [puppet] - 10https://gerrit.wikimedia.org/r/481207 [21:05:11] !log phab1002 - apt autoremove [21:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:57] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: allow hosts to have multiple roles [puppet] - 10https://gerrit.wikimedia.org/r/481207 (owner: 10Bstorm) [21:17:11] awight: anything that's not visible to an anonymous user should not be dumped [21:17:48] if fields are hidden or deleted after the dump, that's how it is [21:18:22] apergos: Thanks for the concise rubric! [21:18:28] sure thing [21:19:29] I'll have schemas for your review soon fyi, on https://phabricator.wikimedia.org/T209732 [21:23:31] (03PS2) 10Cwhite: mediawiki: enable statsd_exporter and add matching rules to appserver [puppet] - 10https://gerrit.wikimedia.org/r/481110 (https://phabricator.wikimedia.org/T205870) [21:26:18] 10Operations, 10Wikimedia-Mailing-lists: Administrator password recovery for wmfaliens@lists.wikimedia.org - https://phabricator.wikimedia.org/T212525 (10colewhite) p:05Triage→03High a:03colewhite [21:27:51] RECOVERY - Check systemd state on phab1002 is OK: OK - running: The system is fully operational [21:28:38] !log phab1002 - mkdir -p /srv/phab/libext/ava/src ; touch __phutil_library_init__.php [21:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:28] !log phab1002 - temp hack to unbreak phd / systemd alert, real fix will be phab deployment to new server [21:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:33] ok great, I've subscribed to the ticket [21:37:52] 10Operations, 10Wikimedia-Mailing-lists: Administrator password recovery for wmfaliens@lists.wikimedia.org - https://phabricator.wikimedia.org/T212525 (10colewhite) Email set to @Elena with reset password. [21:44:24] 10Operations, 10Wikimedia-Mailing-lists: Request to create mailing list for Wikimedians of Chicago User Group - https://phabricator.wikimedia.org/T212266 (10colewhite) The list has been created. At your convenience, please add the list to: https://meta.wikimedia.org/wiki/Mailing_lists/Overview [21:44:52] 10Operations, 10Wikimedia-Mailing-lists: Request to create mailing list for Wikimedians of Chicago User Group - https://phabricator.wikimedia.org/T212266 (10colewhite) 05Open→03Resolved [21:51:37] (03PS1) 10Cwhite: admin: add ldap-only user jmatazzoni [puppet] - 10https://gerrit.wikimedia.org/r/481209 (https://phabricator.wikimedia.org/T212334) [21:54:00] (03CR) 10Dzahn: [C: 03+1] admin: add ldap-only user jmatazzoni [puppet] - 10https://gerrit.wikimedia.org/r/481209 (https://phabricator.wikimedia.org/T212334) (owner: 10Cwhite) [21:55:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188 (10hashar) 05Open→03Resolved a:03ema I am pretty sure @ema finished up the integration of varnishtest with CI / rake test. Well done :) [21:55:54] (03CR) 10Cwhite: [C: 03+2] admin: add ldap-only user jmatazzoni [puppet] - 10https://gerrit.wikimedia.org/r/481209 (https://phabricator.wikimedia.org/T212334) (owner: 10Cwhite) [22:08:49] PROBLEM - Restbase root url on restbase1017 is CRITICAL: connect to address 10.64.32.129 and port 7231: Connection refused [22:13:58] (03PS1) 10Krinkle: Disable Navigation Timing on closed/private/fishbowl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481212 [22:14:53] RECOVERY - Restbase root url on restbase1017 is OK: HTTP OK: HTTP/1.1 200 - 16164 bytes in 0.005 second response time [22:15:28] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10hashar) [22:15:32] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771 (10hashar) 05Stalled→03Resolved a:03hashar The original intent was to have two masters. But we can not commi... [22:16:30] that restbase1017 one.. that is systemd starting the service.. hmm [22:54:01] (03PS1) 10Bstorm: toolforge: add the new cloud region to all_networks [puppet] - 10https://gerrit.wikimedia.org/r/481215 [23:05:32] (03CR) 10Bstorm: "I wonder if both values for cloud stuff should actually be somewhere else." [puppet] - 10https://gerrit.wikimedia.org/r/481215 (owner: 10Bstorm)