[00:00:15] (03PS1) 10Thcipriani: Revert "all wikis to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365194 [00:01:21] (03CR) 10Thcipriani: [C: 032] Revert "all wikis to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365194 (owner: 10Thcipriani) [00:01:22] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [00:02:04] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180029.54 seconds [00:02:39] (03Merged) 10jenkins-bot: Revert "all wikis to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365194 (owner: 10Thcipriani) [00:02:52] (03CR) 10jenkins-bot: Revert "all wikis to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365194 (owner: 10Thcipriani) [00:42:42] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 41.38% of data above the critical threshold [1800.0] [01:10:00] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3438252 (10Papaul) Update on case Dell now uses a company called Unisys to perform all the cal services when I comes to part replacement. They do received the call and send the dispatch... [01:45:55] 10Operations, 10Commons, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3438283 (10Jeff_G) I can no longer reproduce this error, so thanks to whoever or whatever fixed it! [02:17:23] 10Operations, 10monitoring: create netmon1003, migrate servermon from netmon1001 to netmon1003 - https://phabricator.wikimedia.org/T170653#3438295 (10Dzahn) [02:19:16] (03PS1) 10Dzahn: add netmon1003, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/365199 (https://phabricator.wikimedia.org/T170653) [02:26:52] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:27:53] ACKNOWLEDGEMENT - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn saw on IRC its already known and ema working on puppetizing missing kernel module [03:47:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:48:00] 10Operations: VM request: netmon1003 - https://phabricator.wikimedia.org/T170655#3438336 (10Dzahn) [03:49:13] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:04:22] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:05:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:08:36] !copy LE ssl cert/key for librenms from netmon1002 to netmon2001 | (T166180) [04:08:36] T166180: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180 [04:10:02] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=767.10 Read Requests/Sec=478.10 Write Requests/Sec=19.10 KBytes Read/Sec=51026.40 KBytes_Written/Sec=508.00 [04:19:12] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=12.80 Read Requests/Sec=0.70 Write Requests/Sec=12.00 KBytes Read/Sec=3.20 KBytes_Written/Sec=253.20 [04:21:35] !log netmon1002/netmon2001 - change UID/GID for rancid to universal 445/445, use find -exec to chown existing files, for unmessy data syncing, define UID on wikitech page UID (T166180) [04:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:47] T166180: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180 [04:37:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [04:46:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:33:15] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3438408 (10Marostegui) And db1106 got installed finely [05:33:53] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3438409 (10Marostegui) 05Open>03Resolved [05:35:13] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [05:50:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365202 [05:50:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365202 [05:52:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365202 (owner: 10Marostegui) [05:53:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365202 (owner: 10Marostegui) [05:53:23] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365202 (owner: 10Marostegui) [05:54:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T166204 (duration: 00m 46s) [05:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:27] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:05:59] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [06:09:21] (03Abandoned) 10Muehlenhoff: Enable base::firewall for labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364945 (owner: 10Muehlenhoff) [06:29:20] (03CR) 10Muehlenhoff: "IMO the better way to limit someone from sending malware to our lists would be to restrict attachments to e.g. plain text and a few select" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [06:46:49] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [06:47:49] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [06:59:24] !log Create views for dinwiki on labsdb1009, 1010 and 1011 - T169193 [06:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:36] T169193: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193 [07:00:29] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3438449 (10Marostegui) [07:04:09] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:04:49] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [07:05:34] 10Operations: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3438450 (10MoritzMuehlenhoff) I've now built nodejs 6.11 with the recent security fixes and imported it to stretch-wikimedia. [07:15:59] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [07:19:19] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:22:45] !log Stop replication on labsdb1011 for maintenance - T153743 [07:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:56] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:30:13] (03Abandoned) 10Marostegui: mariadb: Split eventlogging, misc, monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/325764 (https://phabricator.wikimedia.org/T152081) (owner: 10Marostegui) [08:00:27] (03CR) 10Muehlenhoff: "I've followed up on T170496, let's use that for further discussion" [puppet] - 10https://gerrit.wikimedia.org/r/365096 (https://phabricator.wikimedia.org/T170496) (owner: 10Ottomata) [08:05:45] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3438511 (10fgiunchedi) [08:05:47] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Complete stretch reimage for ms-fe / ms-be fleet - https://phabricator.wikimedia.org/T169601#3438509 (10fgiunchedi) 05Open>03Resolved This is completed! 62x swift hosts with stretch [08:08:20] (03CR) 10Alexandros Kosiaris: "A nice question to answer would be" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [08:09:09] (03CR) 10Alexandros Kosiaris: "moritz: FYI ^" [puppet] - 10https://gerrit.wikimedia.org/r/365095 (https://phabricator.wikimedia.org/T170496) (owner: 10Ottomata) [08:10:59] (03CR) 10Alexandros Kosiaris: "scratch that, I 've seen now the discussion on the rest of the related tasks" [puppet] - 10https://gerrit.wikimedia.org/r/365095 (https://phabricator.wikimedia.org/T170496) (owner: 10Ottomata) [08:12:56] (03CR) 10Alexandros Kosiaris: [C: 04-2] "No, please let's not do this. I 'd rather the policy was very clearly specified and not dependent on some parameter. Not to mention the fa" [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [08:14:38] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3438514 (10jcrespo) Thank you all people for the help. [08:35:33] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Audit groups of metrics in Graphite that allocate a lot of disk space - https://phabricator.wikimedia.org/T1075#3438523 (10fgiunchedi) [08:35:35] 10Operations, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi: Delete "servers" metrics in graphite older than 60d - https://phabricator.wikimedia.org/T169972#3438521 (10fgiunchedi) 05Open>03declined We'll need to have a different strategy for 'servers' for e.g. long term trending [08:39:56] 10Operations: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3438542 (10MoritzMuehlenhoff) [08:42:04] (03PS4) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363200 (https://phabricator.wikimedia.org/T169647) [08:48:26] 10Operations, 10ops-eqiad: labsdb1001: Swap eth0 cable - https://phabricator.wikimedia.org/T137555#3438553 (10akosiaris) 05Resolved>03stalled I 'll reopen this, albeit as `stalled` as this hasn't been resolved. I had a quick look both on switch side and host side. The host is reporting an `MDI-X: Unknown`... [08:52:50] ACKNOWLEDGEMENT - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0: alexandros kosiaris T86541 [08:54:34] (03PS3) 10Alexandros Kosiaris: lvs: Remove all bgp keywords from configuration [puppet] - 10https://gerrit.wikimedia.org/r/356790 [08:55:39] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:55:39] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [08:56:40] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (doing), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3438563 (10schana) @mobrovac Is there anything outstanding on this task? It looks like some of the work was performed... [08:59:35] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler02/7059/ looks good. Unless some objects I 'll merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/356790 (owner: 10Alexandros Kosiaris) [09:11:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:11:59] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [09:14:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "More restrictive sudo rules" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [09:17:54] (03PS1) 10Muehlenhoff: Add labweb* to site.pp to enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/365207 (https://phabricator.wikimedia.org/T167820) [09:20:16] (03PS1) 10Jcrespo: mariadb: Depool db1062 to clone it to db1072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) [09:23:07] (03CR) 10Marostegui: "Is it db1072 or db2072? Same for db1062." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [09:26:50] (03PS1) 10Muehlenhoff: Add labpuppetmaster* to site.pp to enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/365209 (https://phabricator.wikimedia.org/T167905) [09:40:23] (03PS2) 10Jcrespo: mariadb: Depool db1062 to clone it to db2072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) [09:40:56] (03PS3) 10Jcrespo: mariadb: Depool db2062 to clone it to db2072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) [09:43:53] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2062 to clone it to db2072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [09:57:47] !log uploaded nodejs_6.11.0~dfsg-1+wmf to apt.wikimedia.org (for jessie and stretch) (T170548) [09:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:59] T170548: nodejs 6.11 - https://phabricator.wikimedia.org/T170548 [09:58:16] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3438727 (10Aklapper) [10:01:11] (03PS4) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [10:10:46] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2062 to clone it to db2072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:11:48] (03Merged) 10jenkins-bot: mariadb: Depool db2062 to clone it to db2072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:12:00] (03CR) 10jenkins-bot: mariadb: Depool db2062 to clone it to db2072, and other hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365208 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:14:08] 10Operations, 10Icinga, 10monitoring: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3438807 (10jcrespo) As a comment, the "phantom results" still happens, but it was not part of the scope of this, nor it is that much impacting. With phantom results I m... [10:17:17] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2062 (duration: 00m 47s) [10:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:33] why is dbstore1002 lagging? no alter than I can see [10:32:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 888.96 seconds [10:33:14] research is doing some queries [10:40:59] they are blocking replication [10:43:29] !log altering wmde_analytics_betafeature_users_today table to ENGINE=InnoDB [10:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:36] (03PS1) 10Jcrespo: analytics-store: Ban Aria/MyISAM tables from WMF infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/365228 [10:54:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 133.09 seconds [10:56:09] elukey: dbstore1002 is at 7% disk capacity, at 6% it will start paging - note we warned you at 10%, at 9% and at 8% and created T168303 [10:56:09] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [11:35:47] !log stop db2062 and db2072 for cloning [11:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:28] (03PS1) 10Marostegui: s6.hosts: Add labsdb1011 [software] - 10https://gerrit.wikimedia.org/r/365233 (https://phabricator.wikimedia.org/T153743) [11:40:18] jynus: I am going to throw 50G more to dbstore1002 (and after that it will still have around 200G more be added if needed) [11:41:04] (03CR) 10Marostegui: [C: 04-2] "s6 has almost finished imported, but not 100% yet, so I will wait to merge this until it is 100% finished" [software] - 10https://gerrit.wikimedia.org/r/365233 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [11:41:16] !log Add 50G to /srv/ on dbstore1002 - T168303 [11:41:20] marostegui: but I am worried [11:41:25] Yeah, me too [11:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:27] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [11:41:29] because it seems to keep growing [11:41:38] quite fast, actually [11:41:47] jynus: I think next week we will run the purge script (he was running it on db1047 yesterday) [11:42:00] ok if that is true [11:50:20] (03CR) 10Marostegui: [C: 031] analytics-store: Ban Aria/MyISAM tables from WMF infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/365228 (owner: 10Jcrespo) [12:56:06] (03PS1) 10Filippo Giunchedi: prometheus: use blackbox-exporter package from Debian [puppet] - 10https://gerrit.wikimedia.org/r/365239 (https://phabricator.wikimedia.org/T169860) [12:56:08] (03PS1) 10Filippo Giunchedi: prometheus: additional blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/365240 (https://phabricator.wikimedia.org/T169860) [13:06:57] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017): Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3439035 (10BBlack) @Johan - Do you have any kind of estimate on Community's inputs here and time... [13:23:39] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:24:39] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:43:06] (03PS1) 10Filippo Giunchedi: kafkatee: send 4xx to logstash as well [puppet] - 10https://gerrit.wikimedia.org/r/365247 [13:45:19] (03CR) 10Andrew Bogott: [C: 032] "This seems right, although I don't understand how it fits with the 'one role only in site.pp' rule." [puppet] - 10https://gerrit.wikimedia.org/r/365207 (https://phabricator.wikimedia.org/T167820) (owner: 10Muehlenhoff) [13:45:33] (03CR) 10Andrew Bogott: [C: 032] Add labpuppetmaster* to site.pp to enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/365209 (https://phabricator.wikimedia.org/T167905) (owner: 10Muehlenhoff) [13:45:45] (03PS2) 10Andrew Bogott: Add labweb* to site.pp to enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/365207 (https://phabricator.wikimedia.org/T167820) (owner: 10Muehlenhoff) [13:47:54] (03PS2) 10Andrew Bogott: Add labpuppetmaster* to site.pp to enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/365209 (https://phabricator.wikimedia.org/T167905) (owner: 10Muehlenhoff) [13:49:24] (03PS2) 10Andrew Bogott: Puppetmaster: Fix apache config ssldir [puppet] - 10https://gerrit.wikimedia.org/r/365053 [13:56:32] (03CR) 10Ottomata: "I agree! 2 things then:" [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [13:58:01] (03PS4) 10MacFan4000: Update ExtensionDistributer versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365137 [14:44:41] (03PS1) 10Jcrespo: mariadb.service: Set start/stop timeout to 10 minutes [software] - 10https://gerrit.wikimedia.org/r/365255 [14:50:05] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=736.40 Read Requests/Sec=733.20 Write Requests/Sec=16.00 KBytes Read/Sec=19922.80 KBytes_Written/Sec=888.40 [14:51:57] (03CR) 10Faidon Liambotis: [C: 04-1] WIP: base::kernel: add base::kernel::module (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [14:52:05] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=231.80 Read Requests/Sec=282.80 Write Requests/Sec=24.00 KBytes Read/Sec=3254.40 KBytes_Written/Sec=1456.40 [14:58:15] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=794.90 Read Requests/Sec=287.60 Write Requests/Sec=19.00 KBytes Read/Sec=3384.00 KBytes_Written/Sec=1381.60 [15:01:15] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=633.30 Read Requests/Sec=1110.10 Write Requests/Sec=52.50 KBytes Read/Sec=19022.00 KBytes_Written/Sec=1993.60 [15:05:24] (03PS3) 10Eevans: Configure an additional data file directory [puppet] - 10https://gerrit.wikimedia.org/r/365081 [15:05:25] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=11.10 Read Requests/Sec=0.70 Write Requests/Sec=17.80 KBytes Read/Sec=6.80 KBytes_Written/Sec=338.80 [15:07:44] mailman i/o is me, going to downtime it [15:13:12] (03PS3) 10Andrew Bogott: Puppetmaster: Fix apache config ssldir [puppet] - 10https://gerrit.wikimedia.org/r/365053 [15:13:13] (03PS1) 10Andrew Bogott: Puppetmaster profile: Support switching off active records [puppet] - 10https://gerrit.wikimedia.org/r/365257 [15:17:44] (03PS9) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [15:17:46] (03PS9) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 [15:17:53] (03PS4) 10Andrew Bogott: Puppetmaster: Fix apache config ssldir [puppet] - 10https://gerrit.wikimedia.org/r/365053 [15:17:55] (03PS2) 10Andrew Bogott: Puppetmaster profile: Support switching off active records [puppet] - 10https://gerrit.wikimedia.org/r/365257 [15:17:57] (03PS1) 10Andrew Bogott: Include labsrootpass on labs puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/365258 [15:19:13] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [15:19:56] (03CR) 10Andrew Bogott: [C: 032] Include labsrootpass on labs puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/365258 (owner: 10Andrew Bogott) [15:49:35] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2128143 [15:50:35] (03PS1) 10Muehlenhoff: Migrate former Salt minion to standalone tools executed via Cumin (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/365263 [15:51:18] robh: do you take care about mailserver(s) as well? [15:51:50] Danny_B: not directly, we actually have an opsen working on mail right now (dmarc investigation and the like) [15:52:05] but im on clinic duty so if you have a request involving them i can make sure they know about it! [15:52:09] the spam wave is increasing [15:52:18] the spam to lists? [15:52:32] there was an email about that this am to the list owners being spammed [15:52:32] to *-owner emails [15:52:59] yup. and i've sent some investigations there on which the blocks should be set [15:53:44] we just need to cut those emails before they hit any application such as mailman [15:55:19] i'm getting 2 mails / min in average recently [15:56:04] Danny_B: ok, i can confirm im chatting iwth him [15:56:10] he is working on some patches [15:56:48] There is some concern with rolling a change on friday that could break mailman though [15:56:50] pls let him know about the investigations i've sent [15:56:56] he is on that htread [15:56:59] ok [15:57:04] its keith herron, he is the one opsen who replied =] [15:57:13] don't patch mailman, simply block it on mailserver directly [15:57:43] it was pointed out that is one of the larger mail deliver servies in china [15:57:51] im not sure blocking it entirely is the course we can take [15:57:56] yes. and that's why i've narrowed it [15:58:11] is that the subject always section? [15:58:34] i've sent the regexp for email address and subject [15:58:44] Yes, I am looking at it [15:58:56] up till now, all mails i've got matched that [15:59:46] notifying qq.com about the issue could also be helpful (assuming the sender address is not spoofed) [15:59:52] so https://phabricator.wikimedia.org/T170601 is the task, and Ive relayed to him about the blocking at the mta level before passign to mailman [16:00:10] (03PS1) 10Addshore: Remove time command from statistics::wmde crons [puppet] - 10https://gerrit.wikimedia.org/r/365264 (https://phabricator.wikimedia.org/T170282) [16:00:24] ^^ it would be awesome if someone could merge that one before the weekend ^^ [16:00:38] Fixing some fallout from moving those crons from stat1002 to stat1005 [16:01:25] addshore: if its wholly fubar its not going to break anything but those wmde scripts? [16:01:43] I can see it changes the command, but im not familiar with the service in any regard [16:02:08] ie: i worry about merging things when i am not 100% certain what they touch down the line (service dependencies) [16:02:11] essentially the user running the cron cant run the time command on the server so the crons are totally broken right now [16:02:31] the only change is the time command is no longer output in the logs, but that means the crons will actually run over the weekend [16:02:55] and you've run these strings without time, etc? [16:02:59] yup [16:03:03] sorry if it seems im being overly cautious! [16:03:09] im happy to merge for you then [16:03:17] no worries :) I can understand as this is the first time you have seen this random stuff ;) [16:03:24] herron: Danny_B =] [16:03:33] chat in here so i dont have to repeat stuff! [16:03:40] but i wanna follow along [16:03:51] Danny_B: can you pass along the spam scoring for your spam emails if possible? [16:03:54] robh thanks! [16:04:08] spam scoring? [16:04:15] (03CR) 10RobH: [C: 032] Remove time command from statistics::wmde crons [puppet] - 10https://gerrit.wikimedia.org/r/365264 (https://phabricator.wikimedia.org/T170282) (owner: 10Addshore) [16:04:32] thanks robh! :) [16:04:36] Danny_B yes I'm curious what the spam scores on those messages look like. I'd prefer not to play whack a mole with subject and from header if possible [16:04:41] Now I don't have to run them manually each day :D [16:04:44] Danny_B could you bounce some will full headers? [16:05:13] Hi jynus! ¿Estás ahí? [16:07:04] herron: atm those regexps i've sent are 100% working while none of the actions taken so far is :-/ [16:07:37] (03PS1) 10Filippo Giunchedi: prometheus: move external_url to class parameter [puppet] - 10https://gerrit.wikimedia.org/r/365266 [16:08:20] herron: https://pastebin.com/uhCQ6N95 [16:08:26] is that what you needed? [16:08:53] Danny_B yes, thanks! [16:09:00] my thunderbird ignores that and does not mark it as spam. my mail provider also does not move it to the spamfolder [16:09:15] so obviously marking it does not work as desired [16:09:40] another 56 emails during the chat here :-/ [16:11:03] Danny_B ok, what I am thinking is lower the spam score that triggers a reject by the MTA. this way we're not back to square one if/when the subject changes. [16:11:52] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3366966 (10Koavf) It's live and seems like it was all imported correctly (thanks Warburg) but it's still not live on Wikidata for in... [16:13:38] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3439532 (10mobrovac) @Ottomata @bearND @Gehel @elukey @akosiaris could you please check and make sure your respective services work on node 6.11? We plan rolling it out next week. [16:13:47] (03PS1) 10Herron: Change lists to reject spam score of 6 or greater via exim acl [puppet] - 10https://gerrit.wikimedia.org/r/365267 (https://phabricator.wikimedia.org/T170601) [16:16:45] i hardly doubt the subject will change that particular part i used for filtering but do whatever you think is effective... [16:21:29] Could anyone help me see what actual db queries are run when these errors occur? https://goo.gl/nJmg2r [16:21:33] herron: 1) you're patching mailman, but it can simply come to any wikimedia address [16:22:20] As per T170591, a query tries to insert up to 113800 rows... [16:22:20] T170591: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591 [16:22:26] thx in advance! [16:22:42] herron: besides the question is, if the mails go through mailman, or if they are being sent directly to owner address [16:23:09] (as the owner address is publicly listed on lists homepages) [16:23:53] Danny_B this is an exim patch for the MTA on lists.wikimedia.org [16:26:04] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3439560 (10RobH) This is now scheduled to take place on 2017-07-17. [16:29:28] herron: ok, i've read conf.mailman.erb so i've got confused then [16:30:59] not sure if you already set it alive, in any case i'm now receiving yet more than before... [16:32:06] (03CR) 10Herron: [C: 032] Change lists to reject spam score of 6 or greater via exim acl [puppet] - 10https://gerrit.wikimedia.org/r/365267 (https://phabricator.wikimedia.org/T170601) (owner: 10Herron) [16:32:14] (03PS2) 10Herron: Change lists to reject spam score of 6 or greater via exim acl [puppet] - 10https://gerrit.wikimedia.org/r/365267 (https://phabricator.wikimedia.org/T170601) [16:34:41] Danny_B will be applied shortly, I'll ping you [16:36:43] (03PS28) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [16:36:46] !log lowered mailman/lists spam_score exim acl to 6 - T170601 [16:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:58] T170601: Massive spam to -owner mailing lists - https://phabricator.wikimedia.org/T170601 [16:37:45] Danny_B applied, how is your inbox? [16:38:03] ask me after couple mins ;-) [16:38:45] ha ok :) fwiw I am seeing lots of qq rejects in the exim log [16:38:56] it is temporary solution hopefully? [16:42:56] herron: seems ok now. but i am worried about the low score [16:43:06] mark godog akosiaris _joe_ paravoid marostegui Hi! ^ ? (My question about 20 min ago in backscroll...) Thanks in advance!!! :) [16:43:59] AndyRussG: a lot of them may be gone since its later in the day on a friday for them [16:44:11] but if you have a phab task about it, i can bring it up in the ops meeting as a blocker or needed help [16:44:25] (ops meeting on monday) [16:44:38] robh: T170591 (Just set as unbreak now, infact) [16:44:38] T170591: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591 [16:44:54] I might be able to trigger the error [16:45:08] Or better yet, reproduce it on the beta cluster [16:45:17] That'd be safer fer sure [16:45:43] i've flagged it for listing in the ops meeting notes on monday. [16:45:54] K thx! [16:46:01] i know its not the fastest answer, but better than none right =] [16:46:25] fer sure [16:46:36] AndyRussG: DBAs are off for the day, but debugging live in prod is probably not a great idea indeed [16:46:47] Danny_B I think we should give it some time. in the event of false positive the sender should receive a DSN. so far I'm seeing truckloads of qq.com rejects [16:46:50] AndyRussG: also if we flag this with #dba [16:46:52] AndyRussG: it'd be better for you to reproduce this in beta or your dev environment [16:46:58] it will show up on their radar [16:47:54] robh paravoid k thx...! yes jynus did comment, info he provided is what made us think it's urgent [16:49:58] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Performance, 10WMDE-Fundraising-CN: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3439599 (10AndyRussG) [17:00:23] addshore: i forgot to ping you when stat1005 puppet run was done [17:00:28] but your patch is now live [17:00:38] has been for a bit ;] [17:02:18] (03CR) 10Dzahn: "yea, agree, i was actually already trying this last night with [[::digit::]] etc but couldn't get it to work and then if we still end up w" [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [17:02:54] (03CR) 10Dzahn: [C: 04-1] "will try some more to fine-tune the sudo privs" [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [17:04:28] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3439664 (10Jayprakash12345) [17:05:15] PROBLEM - nutcracker process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:06:05] RECOVERY - nutcracker process on thumbor1003 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [17:07:39] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3439671 (10RobH) a:05RobH>03Andrew [17:09:35] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=575.60 Read Requests/Sec=699.00 Write Requests/Sec=112.50 KBytes Read/Sec=17123.20 KBytes_Written/Sec=1586.80 [17:10:35] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=226.50 Read Requests/Sec=123.70 Write Requests/Sec=1.30 KBytes Read/Sec=1222.00 KBytes_Written/Sec=183.20 [17:16:46] (03CR) 10Faidon Liambotis: [C: 04-1] Puppetmaster profile: Support switching off active records (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365257 (owner: 10Andrew Bogott) [17:16:48] (03PS1) 10RobH: setting labnet100[34] install params [puppet] - 10https://gerrit.wikimedia.org/r/365270 [17:17:43] (03CR) 10RobH: [C: 032] setting labnet100[34] install params [puppet] - 10https://gerrit.wikimedia.org/r/365270 (owner: 10RobH) [17:19:29] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3439699 (10RobH) [17:19:45] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=682.10 Read Requests/Sec=640.90 Write Requests/Sec=2.80 KBytes Read/Sec=6044.80 KBytes_Written/Sec=712.80 [17:21:45] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=966.90 Read Requests/Sec=254.20 Write Requests/Sec=13.00 KBytes Read/Sec=2789.20 KBytes_Written/Sec=674.00 [17:21:56] more downtime needed for mailman io :) [17:22:01] herron: got spam [17:22:16] minute and two ago [17:22:20] thw qq one [17:22:24] *the [17:22:56] Danny_B how many? 2? [17:22:59] X-Spam-Score: 3.3 (+++) [17:23:07] yes 2 atm [17:23:13] ok [17:23:25] subject still matches as well as the sender [17:23:45] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=229.40 Read Requests/Sec=59.50 Write Requests/Sec=9.70 KBytes Read/Sec=834.40 KBytes_Written/Sec=696.80 [17:23:58] the other was X-Spam-Score: 5.6 (+++++) [17:24:21] ok, could you paste the full headers of those? [17:24:27] curious what is different [17:24:56] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Performance, 10WMDE-Fundraising-CN: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3439737 (10jcrespo) > Is there a log of the queries somewhere? This is what I got from the log yo... [17:27:27] herron: https://pastebin.com/XBv8mYsu https://pastebin.com/bbugPQx9 [17:27:57] thanks! [17:30:45] if they took care, they could easily get to only 1.5 [17:31:08] with FREEMAIL_ENVFROM_END_DIGIT and RDNS_NONE [17:32:56] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3439740 (10RobH) a:05RobH>03Cmjohnson The switch shows: ``` ge-7/0/9 up up labnet1003 ge-8/0/12 up up labnet1004 ``` But when attempti... [17:41:20] herron: another 2 came [17:41:48] 3.3 and 3.9 [17:42:06] (03PS2) 10Dzahn: add netmon1003, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/365199 (https://phabricator.wikimedia.org/T170653) [17:45:56] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3439745 (10bearND) All tests pass in mobileapps. [17:46:08] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3439748 (10bearND) [17:50:21] (03CR) 10Dzahn: [C: 032] add netmon1003, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/365199 (https://phabricator.wikimedia.org/T170653) (owner: 10Dzahn) [17:50:52] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3434983 (10Arlolra) I ran our battery of tests for Parsoid and everything seems fine. It'd be preferable to upgrade ruthenium to v6.11.0 first, and then we can do a run of rountrip testing there before... [17:55:05] 10Operations, 10vm-requests: VM request: netmon1003 - https://phabricator.wikimedia.org/T170655#3439757 (10Dzahn) [17:56:49] 10Operations, 10Services (next), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3439762 (10MoritzMuehlenhoff) @Arlolra : Sure, I can upgrade ruthenium on Monday [18:06:31] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3439812 (10Xqt) [18:07:48] 10Operations, 10vm-requests: VM request: netmon1003 - https://phabricator.wikimedia.org/T170655#3439825 (10Dzahn) ``` gnt-instance add \ -t drbd \ -I hail \ --net 0:link=public \ -g row_A \ --hypervisor-parameters=kvm:boot_order=network \ -o debootstrap+default \ --no-inst... [18:11:04] herron: another 7 in the meantime [18:21:11] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3439878 (10Xqt) [18:22:00] (03PS1) 10Dzahn: install_server: add netmon1003 to DHCP/partman [puppet] - 10https://gerrit.wikimedia.org/r/365276 (https://phabricator.wikimedia.org/T170655) [18:24:57] (03CR) 10Dzahn: [C: 032] install_server: add netmon1003 to DHCP/partman [puppet] - 10https://gerrit.wikimedia.org/r/365276 (https://phabricator.wikimedia.org/T170655) (owner: 10Dzahn) [18:25:03] (03PS2) 10Dzahn: install_server: add netmon1003 to DHCP/partman [puppet] - 10https://gerrit.wikimedia.org/r/365276 (https://phabricator.wikimedia.org/T170655) [18:25:39] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Performance, 10WMDE-Fundraising-CN: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3439891 (10AndyRussG) Ah OK @jcrespo... Yeah, I see it was right there, sorry... And thanks!!! [18:28:32] herron: another 6. i don't think such frequency is acceptable [18:29:49] Danny_B ok I hear you. I'll put a filter rule in for the from/subject combination you sent along earlier [18:43:11] (03PS1) 10Herron: Lists: Add exim filter for spam observed from qq.com [puppet] - 10https://gerrit.wikimedia.org/r/365279 (https://phabricator.wikimedia.org/T170601) [18:44:12] (03PS2) 10Herron: Lists: Add exim filter for spam observed from qq.com [puppet] - 10https://gerrit.wikimedia.org/r/365279 (https://phabricator.wikimedia.org/T170601) [18:45:47] (03CR) 10Herron: [C: 032] Lists: Add exim filter for spam observed from qq.com [puppet] - 10https://gerrit.wikimedia.org/r/365279 (https://phabricator.wikimedia.org/T170601) (owner: 10Herron) [18:46:03] (03PS3) 10Herron: Lists: Add exim filter for spam observed from qq.com [puppet] - 10https://gerrit.wikimedia.org/r/365279 (https://phabricator.wikimedia.org/T170601) [18:51:04] Danny_B done, please let me know if you receive any more with the same attributes [18:51:56] herron: thx. in the meantime i've got another 10. with as low as 3.3 score [18:52:15] gtg afk for a bit now, but will keep you posted [18:52:44] herron: actually are you sure you put it live? i *just* received another one [18:53:41] anyway, !log & some infor in phab / list-owners would be cool too -> thx [18:53:44] afk [18:54:16] !log added exim from/subject filter for spam observed from qq.com - T170601 [18:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:27] T170601: Massive spam to -owner mailing lists - https://phabricator.wikimedia.org/T170601 [18:57:58] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3439956 (10Urbanecm) 05Open>03Resolved a:03MarcoAurelio >>! In T168444#3435373, @Obsuser wrote: > No, they are not equivalent when... [18:58:23] (03PS5) 10Andrew Bogott: Puppetmaster: Fix apache config ssldir [puppet] - 10https://gerrit.wikimedia.org/r/365053 [18:58:25] (03PS3) 10Andrew Bogott: Puppetmaster profile: Support switching off active records [puppet] - 10https://gerrit.wikimedia.org/r/365257 [19:03:44] (03CR) 10Marostegui: [C: 031] "+10000 (and I would even make it 20 minutes)" [software] - 10https://gerrit.wikimedia.org/r/365255 (owner: 10Jcrespo) [19:07:26] (03PS3) 10Urbanecm: Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) [19:08:17] (03CR) 10Urbanecm: "Removed logos which are different from the non-HD ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:12:50] (03PS2) 10Jcrespo: mariadb.service: Set start/stop timeout to 10 minutes [software] - 10https://gerrit.wikimedia.org/r/365255 [19:12:52] (03PS1) 10Jcrespo: mariadb: Add db2072 to the list of enwiki hosts [software] - 10https://gerrit.wikimedia.org/r/365283 (https://phabricator.wikimedia.org/T170662) [19:14:09] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2062 to clone it to db2072, and other hosts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365284 [19:16:41] (03PS1) 10Jcrespo: mariadb: Pool db2072 with low load as s1 main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365285 (https://phabricator.wikimedia.org/T170662) [19:18:43] (03PS1) 10BBlack: librenms: bugfix for HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/365286 [19:20:00] (03CR) 10BBlack: [C: 032] librenms: bugfix for HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/365286 (owner: 10BBlack) [19:20:30] (03PS3) 10Urbanecm: Provide HD logos for several Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) [19:36:03] (03PS6) 10Strainu: Set collation for Romanian wikis to uca-ro-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) [19:36:51] Ops, what do you think of merging https://gerrit.wikimedia.org/r/#/c/360891/ https://gerrit.wikimedia.org/r/#/c/360887/ [19:37:02] mutante: if you have a little time [19:40:36] (03PS1) 10Urbanecm: Allow uploads to autoconfirmed-only at huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365288 (https://phabricator.wikimedia.org/T169438) [19:42:48] (03PS1) 10Halfak: Adds aspell-el to ORES base.pp [puppet] - 10https://gerrit.wikimedia.org/r/365289 [19:49:30] (03CR) 10Urbanecm: "Syntax of UpdateCollation.php (it is needed to be run multiple times)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [19:50:10] 10Puppet, 10ORES, 10Scoring-platform-team: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440098 (10Halfak) [19:50:15] 10Puppet, 10ORES, 10Scoring-platform-team: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440111 (10Halfak) https://gerrit.wikimedia.org/r/#/c/365289 [19:50:34] (03PS2) 10Halfak: Adds aspell-el to ORES base.pp [puppet] - 10https://gerrit.wikimedia.org/r/365289 (https://phabricator.wikimedia.org/T170709) [19:50:48] !log wikitech-static: re-enabled HSTS - line was commented out in Apache config, activated it again [19:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:30] 10Puppet, 10ORES, 10Scoring-platform-team: Add greek dict to ores puppet base - https://phabricator.wikimedia.org/T170709#3440115 (10Halfak) [20:02:40] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3440144 (10BBlack) It seems like Shopify has been making some improvements on this front since we last checked. I google'd around a bit to see what I could see about Shopify's c... [20:05:34] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3440147 (10BBlack) [20:11:25] 10Operations, 10Domains, 10Traffic, 10User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3440154 (10BBlack) To be clear then: this ticket is about our (WMF's) hosting of `wikipedia.cz` not having a valid SSL cert, and maybe touch... [20:11:44] 10Operations, 10Domains, 10Traffic, 10User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3440159 (10BBlack) [20:11:48] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#3440157 (10BBlack) [20:13:08] (03CR) 10Herron: "That's a good question! Kicked off a clamscan this morning to get an idea. So far it looks like this:" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [20:18:22] herron: i'm back. it did not work - in fact, i've got more spams now. i've checked subject: and from: on them, and they match the rules [20:18:34] rats [20:18:50] checking the scores now [20:19:14] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3440197 (10BBlack) Digging a little deeper, Shopify open-sources a lot of their infrastructure code. It seems likely that they already support the appropriate attributes at leas... [20:19:15] maybe it's the encoding [20:19:23] yeah [20:19:44] out of 61 new spams, 5 randomly picked are below 6 [20:20:13] ok, yeah and I can still see lots above 6 being rejected [20:20:28] so unless you returned the 12 back, we obviously need to find some more productive way than lowering down the score [20:21:00] yes agreed, although lowering it has made an improvement [20:26:04] definitely [20:26:43] let me try to dig out that string in encoded form if it will help [20:27:51] great thanks [20:30:34] herron: /.*=E5=8F=AA=E8=A6=81=E6=8A=95=E7=B4=B8=E8=8D=AD=E9=89=8B=E5=A4=A9=E5=A4=A9.*/ [20:47:07] !log mailbox lag: restarting cp1074 backend [20:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:46] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [20:53:04] (03PS1) 10Dzahn: install_server: netmon1003 use jessie, not stretch [puppet] - 10https://gerrit.wikimedia.org/r/365376 (https://phabricator.wikimedia.org/T170655) [20:54:44] (03PS2) 10Dzahn: install_server: netmon1003 use jessie, not stretch [puppet] - 10https://gerrit.wikimedia.org/r/365376 (https://phabricator.wikimedia.org/T170655) [20:55:43] (03CR) 10AnotherLadsgroup: "It works ladsgroup@deployment-tin:~$ ./apache-fast-test /home/ladsgroup/mytest.url" [puppet] - 10https://gerrit.wikimedia.org/r/360891 (https://phabricator.wikimedia.org/T163922) (owner: 10Ladsgroup) [20:59:12] (03CR) 10Dzahn: [C: 032] install_server: netmon1003 use jessie, not stretch [puppet] - 10https://gerrit.wikimedia.org/r/365376 (https://phabricator.wikimedia.org/T170655) (owner: 10Dzahn) [21:09:52] Dereckson: what's the local name of "din" please? [21:10:02] is both local and English "Dinka" ? [21:10:49] oh,no, "(natively Thuɔŋjäŋ, Thuɔŋ ee Jieng or simply Jieng) " but which one [21:13:27] Thuɔŋjäŋ [21:14:54] ok, thanks Danny, i used that [21:16:06] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3440400 (10Dzahn) [21:17:47] herron: can i do any further research for you? [21:22:03] Danny_B have more been delivered in the past 10 minutes or so? [21:24:00] can't believe I'm saying this but it would be helpful to subscribe to the spam :) [21:24:31] when manually testing the messages were caught by the filter so wondering what's missing [21:24:33] (03PS1) 10EBernhardson: Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) [21:24:51] (03CR) 10Dzahn: "re: the mailman I/O icinga alerts, they always fire off when there is a spike in reads, i was thinking of turning that whole thing into "w" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [21:26:06] (03CR) 10EBernhardson: "I'm not sure what the right numbers for the more like pool counter should be. 50 is probably incredibly conservative, we can play around w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365406 (https://phabricator.wikimedia.org/T170648) (owner: 10EBernhardson) [21:29:02] herron: yes, couple [21:29:29] 3 in last minute [21:30:17] herron: did you change that string? i haven't notice any notification here... [21:31:17] I was testing a change of check_rfc2047_length setting [21:32:15] if I write a test message with netcat that meets those conditions it is filtered. could you bounce a few full examples to me? would like to do some more testing [21:32:19] and need to sign off soon [21:38:26] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3440460 (10Obsuser) OK, yes, now it's all good with sr.wikiquote logos; this task is resolved. Note: I created T170722 so that en.wikiq... [21:38:28] herron: before you go, please definitely change that string [21:38:36] to that ascii encoded one [21:38:42] ok, let's try it [21:47:07] (03PS1) 10Urbanecm: Update enwikiquote's logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365409 (https://phabricator.wikimedia.org/T170722) [21:48:22] !log netmon1003 - reinstalled with jessie - saw nothing on ganeti console at all which was a bit confusing, but install finished anyways - adding to puppet / signing cert (T170655) [21:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:34] T170655: VM request: netmon1003 - https://phabricator.wikimedia.org/T170655 [21:49:05] 10Operations, 10monitoring, 10Patch-For-Review: create netmon1003, migrate servermon from netmon1001 to netmon1003 - https://phabricator.wikimedia.org/T170653#3440494 (10Dzahn) [21:49:07] 10Operations, 10vm-requests, 10Patch-For-Review: VM request: netmon1003 - https://phabricator.wikimedia.org/T170655#3440493 (10Dzahn) 05Open>03Resolved [22:02:26] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3440532 (10debt) Moving to the tracking column on the backlog for Analysis. [22:02:45] (03PS1) 10Dzahn: site: add netmon1003 with servermon role [puppet] - 10https://gerrit.wikimedia.org/r/365410 (https://phabricator.wikimedia.org/T170653) [22:03:01] (03CR) 10jerkins-bot: [V: 04-1] site: add netmon1003 with servermon role [puppet] - 10https://gerrit.wikimedia.org/r/365410 (https://phabricator.wikimedia.org/T170653) (owner: 10Dzahn) [22:05:41] (03PS2) 10Dzahn: site: add netmon1003 with servermon role [puppet] - 10https://gerrit.wikimedia.org/r/365410 (https://phabricator.wikimedia.org/T170653) [22:05:58] (03CR) 10jerkins-bot: [V: 04-1] site: add netmon1003 with servermon role [puppet] - 10https://gerrit.wikimedia.org/r/365410 (https://phabricator.wikimedia.org/T170653) (owner: 10Dzahn) [22:06:18] (03PS3) 10Dzahn: site: add netmon1003 with servermon role [puppet] - 10https://gerrit.wikimedia.org/r/365410 (https://phabricator.wikimedia.org/T170653) [22:08:24] (03CR) 10Dzahn: [C: 032] site: add netmon1003 with servermon role [puppet] - 10https://gerrit.wikimedia.org/r/365410 (https://phabricator.wikimedia.org/T170653) (owner: 10Dzahn) [22:14:45] PROBLEM - dhclient process on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:15:45] PROBLEM - puppet last run on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:16:36] PROBLEM - salt-minion processes on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:16:44] ACKNOWLEDGEMENT - Check size of conntrack table on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn fresh install [22:16:44] ACKNOWLEDGEMENT - dhclient process on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn fresh install [22:16:44] ACKNOWLEDGEMENT - puppet last run on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn fresh install [22:16:44] ACKNOWLEDGEMENT - salt-minion processes on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn fresh install [22:18:25] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:15] PROBLEM - Check the NTP synchronisation status of timesyncd on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:20:06] PROBLEM - Check whether ferm is active by checking the default input chain on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:05] PROBLEM - DPKG on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:35] Notice: /Stage[main]/Prometheus::Snmp_exporter/Base::Service_unit[prometheus-snmp-exporter]/Service[prometheus-snmp-exporter]: Triggered 'refresh' from 1 events [22:21:43] and then that's the end of it for now.. [22:21:55] PROBLEM - Disk space on netmon1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:16] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational [22:23:35] RECOVERY - salt-minion processes on netmon1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:23:45] RECOVERY - Disk space on netmon1003 is OK: DISK OK [22:23:45] RECOVERY - dhclient process on netmon1003 is OK: PROCS OK: 0 processes with command name dhclient [22:23:55] RECOVERY - DPKG on netmon1003 is OK: All packages OK [22:24:05] RECOVERY - Check whether ferm is active by checking the default input chain on netmon1003 is OK: OK ferm input default policy is set [22:31:42] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): collect usual GC metrics for Blazegraph JVMs - https://phabricator.wikimedia.org/T159248#3440586 (10Smalyshev) [22:31:45] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3440585 (10Smalyshev) [22:31:55] RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [22:32:11] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: WDQS monitoring of response times needs to be adapted now that we use LVS - https://phabricator.wikimedia.org/T148015#3440595 (10Smalyshev) [22:32:14] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: some icinga checks on WDQS do not send notifications - https://phabricator.wikimedia.org/T144948#3440599 (10Smalyshev) [22:32:16] 10Operations, 10Discovery, 10Salt: Failed to deploy WDQS - https://phabricator.wikimedia.org/T132952#3440606 (10Smalyshev) [22:32:31] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#3440607 (10Smalyshev) [22:32:52] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Empty result on a tree query - https://phabricator.wikimedia.org/T127014#3440617 (10Smalyshev) [22:33:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#3440623 (10Smalyshev) [22:33:30] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Need sudo to blazegraph on wdqs1001/1002 - https://phabricator.wikimedia.org/T107819#3440640 (10Smalyshev) [22:33:51] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#3440644 (10Smalyshev) [22:49:06] RECOVERY - Check the NTP synchronisation status of timesyncd on netmon1003 is OK: OK: synced at Fri 2017-07-14 22:49:02 UTC. [22:51:41] (03PS11) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [22:56:39] (03PS10) 10Thcipriani: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [22:56:41] (03PS1) 10Thcipriani: CI/integration: Create profile for docker setup [puppet] - 10https://gerrit.wikimedia.org/r/365416 [22:57:52] (03CR) 10jerkins-bot: [V: 04-1] CI/integration: Create profile for docker setup [puppet] - 10https://gerrit.wikimedia.org/r/365416 (owner: 10Thcipriani) [22:59:23] (03CR) 10Mobrovac: [C: 031] Configure an additional data file directory [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [23:00:03] (03PS2) 10Thcipriani: CI/integration: Create profile for docker setup [puppet] - 10https://gerrit.wikimedia.org/r/365416 [23:22:38] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#3440777 (10aaron) Reading things like https://www.percona.com/blog/2013/10/10/mysql-ssl-performance-overhead/ I think thi...