[02:27:43] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.5) (duration: 08m 03s) [02:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:43] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.6) (duration: 08m 04s) [02:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/2: down - Core: cr2-knams:xe-1/1/0 (GTT, 00341724) {#3466} [10Gbps MPLS]BR [03:01:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jun 26 03:01:35 UTC 2017 (duration 6m 52s) [03:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:22] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [04:09:22] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=6103.30 Read Requests/Sec=4115.80 Write Requests/Sec=0.90 KBytes Read/Sec=41379.60 KBytes_Written/Sec=24.40 [04:15:22] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=15.90 Write Requests/Sec=5.10 KBytes Read/Sec=431.20 KBytes_Written/Sec=256.40 [05:32:32] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Cardamom → GreenC: supervision needed - https://phabricator.wikimedia.org/T168776#3377466 (10Marostegui) p:05Triage>03Normal Looks like most of the edits are on enwiki as per: https://meta.wikimedia.org/wiki/Special:CentralAuth/Green_... [06:21:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Add coment to db1041 running alter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361398 [06:23:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Add coment to db1041 running alter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361398 (owner: 10Marostegui) [06:24:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Add coment to db1041 running alter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361398 (owner: 10Marostegui) [06:26:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove comments from db1041 long running alter status - T166208 (duration: 00m 47s) [06:26:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Add coment to db1041 running alter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361398 (owner: 10Marostegui) [06:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:14] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [06:30:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361399 (https://phabricator.wikimedia.org/T166208) [06:32:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361399 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:34:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361399 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:35:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 - T166208 (duration: 00m 46s) [06:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:21] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [06:36:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361399 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [06:36:54] !log Deploy alter table s7 - db1086 - T166208 [06:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:05] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3377567 (10Marostegui) [06:44:34] !log Drop table wikilove_image_log from s6 - T127219 [06:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:43] T127219: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219 [06:45:43] !log Drop table wikilove_image_log from s4 - T127219 [06:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:18] !log Drop table wikilove_image_log from s2 - T127219 [06:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:47] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3377584 (10Marostegui) [06:49:34] !log Drop table wikilove_image_log from s7 - T127219 [06:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:44] T127219: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219 [06:49:50] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3377586 (10Marostegui) [06:50:47] !log execute sudo -u _graphite find /var/lib/carbon/whisper/eventstreams/rdkafka -type f -mtime +15 -delete on graphite1001 to free some space for /var/lib/carbon [06:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:34] !log Drop table wikilove_image_log from s3 - T127219 [06:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:47] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3377589 (10Marostegui) [06:55:14] !log Drop table wikilove_image_log from s1 - T127219 [06:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:25] T127219: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219 [06:55:53] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3377593 (10Marostegui) [06:56:12] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:56:53] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:58] !log truncated neutron-server.log files in /var/log on labtestnet2001 to free some space in root [06:57:05] andrewbogott,chasemp ---^ [06:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:30] puppet wasn't running and disk space filled three days ago :( [06:57:49] !log Drop table wikilove_image_log from silver - T127219 [06:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:19] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3377605 (10Marostegui) [06:58:20] marostegui: you shouldn't drop wikilove [06:58:31] how rude [06:58:36] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#2036092 (10Marostegui) 05Open>03Resolved This is all done [06:58:36] :D good morning [06:58:51] elukey: don't tell ema! [07:08:15] !log powercycle elastic1017 (stuck in console, no ssh access) [07:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:12] didn't find anything in getsel, not sure what happened to the host [07:09:32] RECOVERY - Host elastic1017 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [07:12:27] gehel: ---^ hope that it wasn't down on purpose, I didn't find any downtime or phab task opened [07:13:45] elukey: arriving at the "office" right now... Checking [07:14:42] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [07:15:43] !log restart pdfrender on scb1002 for the xpra issue [07:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:03] elukey: you just restarted it? it looks fine now... [07:22:13] gehel: yep yep just left you a msg as note [07:22:30] I just powercycled it [07:22:39] (03PS1) 10Gilles: Deploy Thumbor to Commons [puppet] - 10https://gerrit.wikimedia.org/r/361405 (https://phabricator.wikimedia.org/T167795) [07:24:34] elukey: Thanks! [07:32:30] elukey: it looks like it was getting a bit hot... [07:39:46] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3377643 (10Volans) [07:40:35] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3370107 (10Volans) Related to T166965, please do not close until the parent is fixed as well because it will be re-opened again if Icinga loose the downtimes/acknowledges or the alarm flaps for any reason. [07:44:22] 10Operations, 10ops-eqiad, 10DC-Ops: some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3377656 (10Gehel) [07:55:25] !log Stop replication on db1069:3313 (s3) and db1044 in the same position - T166546 [07:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:35] T166546: Point labsdb1001 and labsdb1003 to db1095 - https://phabricator.wikimedia.org/T166546 [08:11:00] !log reboot mw125[4,5,6,7] for kernel updates (appservers) [08:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:51] !log reboot mw1258, 126[6,7,8] for kernel updates (appservers) [08:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:49] (03CR) 10Filippo Giunchedi: [C: 032] Deploy Thumbor to Commons [puppet] - 10https://gerrit.wikimedia.org/r/361405 (https://phabricator.wikimedia.org/T167795) (owner: 10Gilles) [08:37:28] !log roll-restart swift-proxy to use thumbor for commons [08:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:39] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3377852 (10akosiaris) >>! In T164206#3370540, @jcrespo wrote: > The only thing I could think about is to enable the debug log, what do you thi... [08:42:17] 10Operations, 10Patch-For-Review: nutcracker test config in puppet doesn't work - https://phabricator.wikimedia.org/T168705#3377857 (10ArielGlenn) p:05Triage>03Normal [08:48:11] !log reboot mw1269 -> mw1272 for kernel updates (appservers) [08:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:36] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3377894 (10Volans) `512` sounds reasonable and seems the only pertinent one in that list, +1! [08:50:32] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table rel13testwiki.searchindex doesnt exist on query. Default database: rel13testwiki. [Query snipped] [08:50:40] ^I will fix that [08:50:48] !log starting restart of elasticsearch codfw for kernel upgrade [08:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:32] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:58:38] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3377908 (10akosiaris) OK, I 've disabled puppet on tegmen and set `debug_level` to 512. Let's wait a bit and see how this goes and then puppet... [08:58:59] !log reboot mw127[3,4,5] for kernel updates (appservers) [08:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:42] !log reboot mw127[6,7,8,9] for kernel updates (api-appservers) [09:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:44] 10Operations, 10Wikimedia-Logstash: Log lines on flourine overflow at 8092 bytes. - https://phabricator.wikimedia.org/T114849#3377920 (10ArielGlenn) p:05Triage>03Normal According to http://www.rsyslog.com/doc/v8-stable/rainerscript/global.html the default maxMessageSize for rsyslog is 8k, which could be ch... [09:07:31] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3377924 (10ArielGlenn) p:05Triage>03Normal [09:11:05] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3377947 (10ArielGlenn) p:05Triage>03Normal [09:13:05] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3377951 (10ArielGlenn) p:05Triage>03Normal [09:15:35] 10Operations, 10Labs, 10Labs-Infrastructure, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3377954 (10ArielGlenn) p:05Triage>03Normal [09:18:33] 10Operations, 10Analytics-Kanban, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3377967 (10ArielGlenn) p:05Triage>03Normal [09:20:17] (03PS4) 10Giuseppe Lavagetto: Add build script plus nodejs base images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 [09:34:03] !log reboot mw128[0,1,2,3] for kernel updates (api-appservers) [09:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:34] (03PS1) 10Gilles: Turn off chunked transfer encoding on thumbor’s nginx [puppet] - 10https://gerrit.wikimedia.org/r/361428 (https://phabricator.wikimedia.org/T167795) [09:57:38] (03PS2) 10Gilles: Turn off chunked transfer encoding on thumbor’s nginx [puppet] - 10https://gerrit.wikimedia.org/r/361428 (https://phabricator.wikimedia.org/T167795) [10:00:21] (03CR) 10Filippo Giunchedi: [C: 032] Turn off chunked transfer encoding on thumbor’s nginx [puppet] - 10https://gerrit.wikimedia.org/r/361428 (https://phabricator.wikimedia.org/T167795) (owner: 10Gilles) [10:03:19] !log roll-restart nginx on thumbor to disable te: chunked [10:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:42] !log reboot mw128[4,5,6,7] for kernel updates (api-appservers) [10:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:56] (03PS5) 10Giuseppe Lavagetto: Add build script plus nodejs base images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 [10:28:53] !log reboot mw1288->90 for kernel updates (last batch of api-appservers) [10:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:08] (03PS1) 10ArielGlenn: for cirrusdumps, do not create new logs after old ones are rotated [puppet] - 10https://gerrit.wikimedia.org/r/361433 (https://phabricator.wikimedia.org/T162688) [11:05:10] !log roll-restart pybal in codfw to pick up thumbor.svc.codfw.wmnet [11:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:49] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review, 10User-fgiunchedi: Deploy thumbor in codfw - https://phabricator.wikimedia.org/T167801#3378412 (10fgiunchedi) 05Open>03Resolved Thumbor is live in codfw too at `thumbor.svc.codfw.wmnet` using two ex-imagescalers and two ex-appservers in... [11:20:52] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [11:44:52] 10Operations, 10Graphite: Something puts many different metrics into graphite, allocating a lot of disk space - https://phabricator.wikimedia.org/T1075#18643 (10ArielGlenn) Seen today: icinga-wm: PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical thr... [11:53:19] (03PS5) 10Ema: varnish: limit varnishd transient storage size [puppet] - 10https://gerrit.wikimedia.org/r/353274 (https://phabricator.wikimedia.org/T164768) [11:57:42] (03CR) 10Ema: [V: 032 C: 032] varnish: limit varnishd transient storage size [puppet] - 10https://gerrit.wikimedia.org/r/353274 (https://phabricator.wikimedia.org/T164768) (owner: 10Ema) [12:02:46] !log Deploy alter table on s2 codfw master (db2017) and let it replicate - T168661 [12:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:56] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:14:52] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [12:15:52] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:19:21] jouncebot: next [12:19:21] In 0 hour(s) and 40 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T1300) [12:20:19] hashar: hm, the only patch for today is for master branch https://gerrit.wikimedia.org/r/#/c/357238/ [12:20:30] shouldn't it be for a release branch? [12:21:03] zeljkof: gehel said he will handle it which I guess is just about cherry picking it :d [12:21:17] hashar: just notice that [12:21:26] wmf.5 hasn't been pushed to group2 yet [12:21:42] hashar, zeljkof: yep, I should be able to do that on my own, but I'd appreciate if one of you is around to look over my shoulder... [12:22:20] gehel: I will be around [12:22:26] thanks! [12:22:31] and hashar is usually also around at that time [12:25:18] I have cherry picked for both branches [12:25:25] https://gerrit.wikimedia.org/r/#/c/361441/ and https://gerrit.wikimedia.org/r/#/c/361442/ [12:25:32] hashar: thanks! I just saw the notifications! [12:25:41] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3378629 (10akosiaris) Up to now the icinga.debug file is 139k so I think it's safe to enable on the active icinga server as well [12:25:52] and I guess I am going to just CR+2 them so they get in the branches by the time swat starts :D [12:26:38] hashar: so thanks again! [12:27:48] hashar: how does it work to update both .5 and .6 versions? Just pull to both directories and do "scap sync-dir" from the mediawiki-staging directory? [12:28:02] * gehel will be doing his first mediawiki extension deployment [12:32:01] +2 ed [12:33:04] (03PS15) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [12:33:42] (03CR) 10Elukey: "@Mforns: added mediawiki whitelist and also the possibility to connect to mysql via local unix socket (set in my.cnf)" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [12:51:45] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3376111 (10Urbanecm) For simply reopening the wiki removing it from closed.dblist should be enough. After this change would be deployed, the wiki would be editable as before... [12:52:52] (03PS1) 10Alexandros Kosiaris: Renumber hassium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/361443 [12:55:33] !log reboot mw129[5,6,7,8] for kernel update (mw imagescalers, two at the time) [12:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:40] (03PS2) 10Urbanecm: Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) [12:57:33] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [12:58:11] !log Deploy alter table on db2062 and db2055 - T168661 [12:58:12] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [12:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:20] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:58:58] marostegui: any chance that we could run today/tomorrow some alter tables to db1047 ? [12:59:04] or even dbstore1002 [12:59:08] (log database) [12:59:28] elukey: sure, maybe later today (can't asure it), but if not, tomorrow morning ping me when you get online [12:59:34] super [12:59:39] tomorrow morning is fine! [12:59:53] You always wanted to join the morning alter party eh [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T1300). Please do the needful. [13:00:04] gehel and debt: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:14] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber hassium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/361443 (owner: 10Alexandros Kosiaris) [13:00:18] jouncebot: o/ [13:00:30] o/ [13:01:56] jouncebot: o/ [13:03:13] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (doing), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3378753 (10schana) [13:03:22] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%) [13:04:38] again ? [13:04:47] probably logs spamming [13:04:49] checking [13:06:45] yes a lot of ERROR neutron.service OperationalError: (sqlite3.OperationalError) no such table: in neutron-server.log [13:06:51] andrewbogott: ---^ [13:08:14] !log truncate /var/log/upstart/neutron-server.log (root filled up, spam in logs for 'ERROR neutron.service OperationalError: (sqlite3.OperationalError) no such table:') [13:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:22] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [13:08:28] on which host Luca? [13:08:32] * elukey amends the sal.. [13:10:33] lol [13:10:57] elukey starts to talk to himself, not a good sign usually ;) [13:11:10] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [13:11:16] it could be worse [13:11:17] https://phabricator.wikimedia.org/T168661#3378729 [13:11:28] hashar: just to check, to push that change just on mwdebug, I run "scap pull" on mwdebug1001? Correct? [13:11:34] ahahahahahaha [13:12:16] so 2 ops people talking to themselves in the same day [13:12:18] (03CR) 10Hashar: "Note that puppet is broken on deployment-imagescaler01 . Filled as T166013" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [13:12:39] gehel: you have to first pull it on the deployment server [13:12:45] if we weren't geographically distributed there would be some interesting conclusions to make from this [13:12:47] talking to yourself is a sign of insanity only if you are surprised by the answer you get! [13:12:52] gehel: then yes scap pull on mwdebug1001 will deploy it on that machine [13:13:03] note there are two branches! [13:13:07] there is [13:13:12] well whatever my english is gone [13:13:28] down the drain ? [13:13:29] hashar: yes, I pulled both branches on tin [13:13:29] :P [13:13:49] gehel: then scap pull will pull all of that on the host [13:13:50] actually you two could talk in french in a PM ;-) [13:14:04] hmm [13:14:14] pretty sure I am chatting in english with everyone :D [13:14:32] debt: you can test on mwdebug1001 [13:15:16] elukey: in theory I already truncated that log and turned down log levels… did the drive fill up again? [13:16:00] andrewbogott: it seems spamming due to the absence of a table.. the root is pretty tiny (10G IIRC) so it fills up quickly [13:16:38] elukey: oh, I guess if it's ERROR then log levels won't help [13:21:20] PROBLEM - SSH on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:21] PROBLEM - Check systemd state on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:30] PROBLEM - salt-minion processes on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:30] PROBLEM - nutcracker process on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:30] PROBLEM - configured eth on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:31] PROBLEM - MD RAID on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:40] PROBLEM - puppet last run on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:40] PROBLEM - Disk space on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:40] PROBLEM - nutcracker port on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:51] PROBLEM - DPKG on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:52] elukey: didn't come up in time? ^^^^ [13:22:00] PROBLEM - HHVM processes on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:00] PROBLEM - dhclient process on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:00] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - Nginx local proxy to apache on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:00] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - rendering-https_443 - Could not depool server mw1296.eqiad.wmnet because of too many down! [13:22:10] PROBLEM - Check whether ferm is active by checking the default input chain on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:10] PROBLEM - Check size of conntrack table on mw1297 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:28] * elukey cries in a corner [13:22:30] RECOVERY - nutcracker port on mw1297 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:22:30] RECOVERY - Disk space on mw1297 is OK: DISK OK [13:22:37] sorry people [13:22:40] RECOVERY - DPKG on mw1297 is OK: All packages OK [13:22:50] RECOVERY - HHVM processes on mw1297 is OK: PROCS OK: 11 processes with command name hhvm [13:22:50] RECOVERY - dhclient process on mw1297 is OK: PROCS OK: 0 processes with command name dhclient [13:22:51] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.107 second response time [13:22:51] RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.110 second response time [13:22:51] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 79011 bytes in 3.050 second response time [13:23:00] RECOVERY - Check whether ferm is active by checking the default input chain on mw1297 is OK: OK ferm input default policy is set [13:23:00] RECOVERY - Check size of conntrack table on mw1297 is OK: OK: nf_conntrack is 0 % full [13:23:10] RECOVERY - SSH on mw1297 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:23:20] RECOVERY - Check systemd state on mw1297 is OK: OK - running: The system is fully operational [13:23:21] RECOVERY - nutcracker process on mw1297 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:23:21] RECOVERY - configured eth on mw1297 is OK: OK - interfaces up [13:23:21] RECOVERY - salt-minion processes on mw1297 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:23:21] RECOVERY - MD RAID on mw1297 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:23:30] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [13:23:31] the downtime expired a minute before the hosts were up [13:24:14] hashar: ok, seems that this patch does not work as expected, rolling back... [13:24:25] elukey: I don't care (too much ;) ) for the spam, just for Pybal "rendering-https_443 - Could not depool server mw1296.eqiad.wmnet because of too many down!" [13:24:55] hashar: I'll just reset HEAD^ on tin, and then prepare a revert to both those branches [13:25:06] (03PS1) 10Alexandros Kosiaris: Renumber rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/361444 [13:25:40] gehel: :( [13:25:40] volans: there were 4/6 imagescalers up, so I made pybal angry [13:26:05] the important part is that there were enough to serve traffic ;) [13:27:00] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [13:27:14] yes yes [13:27:33] I check https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X periodically [13:27:46] good :) [13:27:56] but it is taking ages even to run puppet on them [13:33:30] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:33:40] PROBLEM - Elasticsearch HTTPS on elastic2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:33:40] PROBLEM - Disk space on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:33:40] PROBLEM - Check size of conntrack table on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:33:50] PROBLEM - DPKG on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:33:50] PROBLEM - MD RAID on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:33:55] lol, gehel time for slow reboot :D [13:34:00] PROBLEM - dhclient process on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:34:00] PROBLEM - salt-minion processes on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:34:10] PROBLEM - SSH on elastic2004 is CRITICAL: connect to address 10.192.0.133 and port 22: Connection refused [13:34:11] PROBLEM - puppet last run on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:34:20] PROBLEM - configured eth on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:34:20] PROBLEM - Check systemd state on elastic2004 is CRITICAL: Return code of 255 is out of bounds [13:34:22] volans: damn, just as I was busy... [13:37:30] PROBLEM - Host elastic2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:10] RECOVERY - salt-minion processes on elastic2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:38:10] RECOVERY - dhclient process on elastic2004 is OK: PROCS OK: 0 processes with command name dhclient [13:38:10] RECOVERY - SSH on elastic2004 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:38:11] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 46 minutes ago with 0 failures [13:38:20] RECOVERY - configured eth on elastic2004 is OK: OK - interfaces up [13:38:20] RECOVERY - Check systemd state on elastic2004 is OK: OK - running: The system is fully operational [13:38:20] RECOVERY - Host elastic2004 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [13:38:30] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2004 is OK: OK ferm input default policy is set [13:38:40] RECOVERY - Disk space on elastic2004 is OK: DISK OK [13:38:40] RECOVERY - Check size of conntrack table on elastic2004 is OK: OK: nf_conntrack is 0 % full [13:38:41] (03PS3) 10Andrew Bogott: proxyleaks: Avoid some edge cases that caused occasional script failure [puppet] - 10https://gerrit.wikimedia.org/r/360994 [13:38:51] RECOVERY - DPKG on elastic2004 is OK: All packages OK [13:38:51] RECOVERY - MD RAID on elastic2004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:39:00] RECOVERY - Elasticsearch HTTPS on elastic2004 is OK: SSL OK - Certificate elastic2004.codfw.wmnet valid until 2022-02-07 08:46:57 +0000 (expires in 1686 days) [13:39:39] (03PS3) 10Filippo Giunchedi: swift: introduce container-reconciler [puppet] - 10https://gerrit.wikimedia.org/r/356198 (https://phabricator.wikimedia.org/T151648) [13:39:41] (03PS7) 10Filippo Giunchedi: swift: introduce storage policies [puppet] - 10https://gerrit.wikimedia.org/r/353878 (https://phabricator.wikimedia.org/T151648) [13:40:58] (03PS1) 10Alexandros Kosiaris: Set debug_level on icinga [puppet] - 10https://gerrit.wikimedia.org/r/361450 (https://phabricator.wikimedia.org/T164206) [13:42:00] (03CR) 10Andrew Bogott: [C: 032] proxyleaks: Avoid some edge cases that caused occasional script failure [puppet] - 10https://gerrit.wikimedia.org/r/360994 (owner: 10Andrew Bogott) [13:42:53] (03PS3) 10Andrew Bogott: wmf_sink: Clean up DNS for cleaned up proxies on instance deletion. [puppet] - 10https://gerrit.wikimedia.org/r/360995 (https://phabricator.wikimedia.org/T168313) [13:43:14] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/361450 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [13:43:20] (03PS1) 10Marostegui: db-eqiad.php: Add comments to db1033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361452 (https://phabricator.wikimedia.org/T166208) [13:44:29] gehel, hashar, can I deploy config? [13:44:50] marostegui: yep, I just finished cleaning up [13:44:59] gehel: thanks :) [13:45:05] (03CR) 10Andrew Bogott: [C: 032] wmf_sink: Clean up DNS for cleaned up proxies on instance deletion. [puppet] - 10https://gerrit.wikimedia.org/r/360995 (https://phabricator.wikimedia.org/T168313) (owner: 10Andrew Bogott) [13:46:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comments to db1033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361452 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [13:46:19] (03CR) 10Jcrespo: [C: 031] Set debug_level on icinga [puppet] - 10https://gerrit.wikimedia.org/r/361450 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [13:47:22] (03Merged) 10jenkins-bot: db-eqiad.php: Add comments to db1033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361452 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [13:47:29] (03CR) 10jenkins-bot: db-eqiad.php: Add comments to db1033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361452 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [13:47:42] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3378911 (10fgiunchedi) [13:48:27] (03PS4) 10Filippo Giunchedi: swift: introduce container-reconciler [puppet] - 10https://gerrit.wikimedia.org/r/356198 (https://phabricator.wikimedia.org/T151648) [13:48:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add comments to db1033 status - T166208 (duration: 00m 48s) [13:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:39] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [13:49:15] (03PS3) 10Andrew Bogott: wmf_sink: Forward some changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/360996 [13:49:56] !log Deploy alter table s7 - db1033 - T166208 [13:49:58] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber rutherfordium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/361444 (owner: 10Alexandros Kosiaris) [13:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:46] (03CR) 10Filippo Giunchedi: [C: 032] swift: introduce container-reconciler [puppet] - 10https://gerrit.wikimedia.org/r/356198 (https://phabricator.wikimedia.org/T151648) (owner: 10Filippo Giunchedi) [13:52:14] (03CR) 10Alexandros Kosiaris: [C: 032] Set debug_level on icinga [puppet] - 10https://gerrit.wikimedia.org/r/361450 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [13:52:18] (03PS2) 10Alexandros Kosiaris: Set debug_level on icinga [puppet] - 10https://gerrit.wikimedia.org/r/361450 (https://phabricator.wikimedia.org/T164206) [13:52:20] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Set debug_level on icinga [puppet] - 10https://gerrit.wikimedia.org/r/361450 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [13:54:56] would anyone please be able to merge a contint patch for me please? I am moving some packages between classes https://gerrit.wikimedia.org/r/#/c/342635/ that is to have them available on nodepool instances :] [13:57:26] (03CR) 10Andrew Bogott: [C: 032] wmf_sink: Forward some changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/360996 (owner: 10Andrew Bogott) [13:57:34] (03PS4) 10Andrew Bogott: wmf_sink: Forward some changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/360996 [13:57:59] (03PS1) 10Filippo Giunchedi: hieradata: add thumbor100[34] to thumbor nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/361454 [13:58:21] (03PS2) 10Jforrester: Enable mobile non-JavaScript editing on all MobileFrontend wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) [13:58:23] (03PS1) 10Jforrester: Enable mobile non-JavaScript editing on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361455 (https://phabricator.wikimedia.org/T125174) [13:59:03] (03CR) 10Jforrester: [C: 04-1] "Planned for 5 July." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361455 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [14:09:31] (03PS1) 10Jcrespo: mariadb: Add basedir support and change default socket location [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) [14:13:55] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [14:16:01] cmjohnson1: the new labvirts (https://phabricator.wikimedia.org/T165531) are waiting on netops? Or do they still need some cables plugged? [14:27:08] i got an email saying that my membership to the Ops mailing list was disabled due to excessive bounces, anyone know what that is about? [14:29:31] <_joe_> urandom: yes, re-subscribe [14:31:11] (03PS1) 10Alexandros Kosiaris: Renumber releases1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/361461 [14:40:35] PROBLEM - Disk space on bast3002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:26] RECOVERY - Disk space on bast3002 is OK: DISK OK [14:46:24] andrewbogott: eth1 is cabled...i just need to do the switch cfg [14:46:33] not sure which vlan they go under [14:47:35] !log Deploy alter table on silver and labtestweb2001 - T168661 [14:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:44] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [15:06:17] (03PS2) 10Jcrespo: mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) [15:06:37] (03CR) 10Jcrespo: [C: 04-2] mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:08:38] (03CR) 10jerkins-bot: [V: 04-1] mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:09:49] cmjohnson1: I don't think I know any more than what's on the ticket already. Same as the other labvirts, however they are :/ [15:09:55] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [15:10:00] okay...thx [15:10:55] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [15:26:56] <_joe_> win 35 [15:27:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361470 [15:27:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361470 [15:29:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361470 (owner: 10Marostegui) [15:32:48] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361470 (owner: 10Marostegui) [15:32:50] (03PS3) 10Jforrester: Enable mobile non-JavaScript editing on all MobileFrontend wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) [15:33:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361470 (owner: 10Marostegui) [15:33:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086 - T166208 (duration: 00m 46s) [15:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:00] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [15:34:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361472 (https://phabricator.wikimedia.org/T166208) [15:36:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361472 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [15:37:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361472 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [15:37:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361472 (https://phabricator.wikimedia.org/T166208) (owner: 10Marostegui) [15:37:47] 10Operations, 10Analytics, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3379313 (10Nuria) [15:38:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T166208 (duration: 00m 46s) [15:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:19] (03PS3) 10Gehel: git::clone - ensure => latest should also work with non default branch [puppet] - 10https://gerrit.wikimedia.org/r/360685 [15:41:10] !log Deploy alter table s7 - db1079 - T166208 [15:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:20] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [15:42:36] (03CR) 10Gehel: [C: 032] git::clone - ensure => latest should also work with non default branch [puppet] - 10https://gerrit.wikimedia.org/r/360685 (owner: 10Gehel) [15:43:16] (03PS1) 10Eevans: Update collector version to 4.0.1 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/361475 (https://phabricator.wikimedia.org/T164274) [15:44:52] (03CR) 10Eevans: [V: 032 C: 032] Update collector version to 4.0.1 [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/361475 (https://phabricator.wikimedia.org/T164274) (owner: 10Eevans) [15:48:23] (03PS5) 10Gehel: Add info to Discovery Dashboards index page [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) (owner: 10Bearloga) [15:51:20] (03CR) 10Gehel: [C: 032] Add info to Discovery Dashboards index page [puppet] - 10https://gerrit.wikimedia.org/r/360592 (https://phabricator.wikimedia.org/T167930) (owner: 10Bearloga) [15:53:43] 10Operations, 10ops-codfw, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw2148 / mw2149 / mw2259 / mw2260 to thumbor200[1234] - https://phabricator.wikimedia.org/T168881#3379407 (10fgiunchedi) [15:55:13] (03PS3) 10Jcrespo: mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) [15:55:58] (03PS1) 10Eevans: Enable/deploy cassandra-metrics-collector 4.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/361478 (https://phabricator.wikimedia.org/T164274) [15:57:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:57:35] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:01:29] (03CR) 10Eevans: [C: 031] "This enables cmcd version 4.0.1, which fixes one more Cassandra 3.x-specific nit (namely https://phabricator.wikimedia.org/T164274). I've" [puppet] - 10https://gerrit.wikimedia.org/r/361478 (https://phabricator.wikimedia.org/T164274) (owner: 10Eevans) [16:02:16] (03PS2) 10Eevans: Enable/deploy cassandra-metrics-collector 4.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/361478 (https://phabricator.wikimedia.org/T164274) [16:03:09] (03PS1) 10Gilles: Serve thumbnails for all public wikis with Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/361479 (https://phabricator.wikimedia.org/T167796) [16:10:33] urandom: ready for 361478? [16:10:52] elukey: sure! [16:11:11] (03CR) 10Elukey: [C: 032] Enable/deploy cassandra-metrics-collector 4.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/361478 (https://phabricator.wikimedia.org/T164274) (owner: 10Eevans) [16:11:59] urandom: merged! [16:12:05] elukey: thank you! [16:18:19] 10Operations, 10Traffic, 10netops, 10User-Joe: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3379576 (10elukey) Just to be sure I'll shutdown kafka on kafka2001 before https://racktables.wikimedia.org/index.php?page=rack&rack_id=2207, please ping me 5/10 mins before the rack :) [16:25:58] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3379612 (10RobH) Chatted with Alex. These need to be jessie. Additionally, the codfw wtp systems have a rootdelay=5 installed in them. Alex mentioned it used to be in the installer, and... [16:26:15] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3379616 (10RobH) [16:27:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3379620 (10Gehel) [16:29:29] (03CR) 10EBernhardson: [C: 031] "Makes sense, logs should only be created by the dump script" [puppet] - 10https://gerrit.wikimedia.org/r/361433 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [16:30:56] 10Operations, 10ops-codfw: Troubleshoot scb2005 NICs - https://phabricator.wikimedia.org/T167763#3379631 (10Papaul) @Marostegui can you please let me know when is the best day to take this system down for troubleshooting with Dell? Thanks. [16:38:25] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:40:21] halfak, https://github.com/wiki-ai/revscoring/pull/327 isn't merged yet. FYI in case you missed that. [16:40:45] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 345 MB (3% inode=75%) [16:42:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:43:30] 10Operations, 10Traffic, 10netops, 10User-Joe: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3379729 (10ayounsi) [16:51:59] (03PS2) 10ArielGlenn: for cirrusdumps, do not create new logs after old ones are rotated [puppet] - 10https://gerrit.wikimedia.org/r/361433 (https://phabricator.wikimedia.org/T162688) [16:52:36] ehm andrewbogott just truncated neutron logs again :) [16:52:45] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [16:53:05] elukey: can you just do 'service stop neutron-server' while you're in there? And log it, and I'll nudge chase when he gets back from his trip. [16:53:37] andrewbogott: will the next puppet run bring it up again? [16:53:48] (03CR) 10ArielGlenn: [C: 032] for cirrusdumps, do not create new logs after old ones are rotated [puppet] - 10https://gerrit.wikimedia.org/r/361433 (https://phabricator.wikimedia.org/T162688) (owner: 10ArielGlenn) [16:53:51] I don't know. That's a test setup, maybe isn't puppetized yet [16:54:03] * elukey tries [16:54:34] seems that puppet didn't bring it back [16:55:03] !log stop neutron-server on labtestnet2001 to avoid the root partition to fill up [16:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:43] elukey: thank you! [16:55:52] yw! [16:59:18] !log EXPERIMENT - T163337 - set slaveof no one on rdb2004 to remove its dependency to rdb2003 (puppet disabled on rdb2004, to rollback just enable/run it) [16:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:28] T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs) - https://phabricator.wikimedia.org/T163337 [16:59:41] (03PS1) 10Alexandros Kosiaris: Remove the etherpad type from export [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/361487 (https://phabricator.wikimedia.org/T168485) [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T1700). [17:02:14] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10RobH) [17:02:16] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10RobH) [17:02:19] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10RobH) [17:02:22] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10RobH) [17:02:23] actually my experiment does not need to stop puppet [17:02:51] mmmm but I need to prevent the daily reboot to bring back rdb2004's settings [17:03:02] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379859 (10RobH) a:05RobH>03chasemp @Chasemp: Please review my racking and vlan/IP proposal above and confirm or correct. Once that is done, please ass... [17:03:14] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379861 (10RobH) a:05RobH>03chasemp @Chasemp: Please review my racking and vlan/IP proposal above and confirm or correct. Once that is done, pleas... [17:03:17] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379863 (10RobH) a:05RobH>03chasemp @Chasemp: Please review my racking and vlan/IP proposal above and confirm or correct. Once that is done, pleas... [17:03:31] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10RobH) a:05RobH>03chasemp @Chasemp: Please review my racking and vlan/IP proposal above and confirm or correct. Once that is done, plea... [17:04:23] 10Operations, 10hardware-requests: codfw: (2) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3379867 (10RobH) 05Open>03Resolved These two hosts have been ordered and received in, and are being setup via T168891 and T168892. [17:04:27] 10Operations, 10Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3379872 (10RobH) 05Open>03Resolved These two hosts are on site and being setup via tasks T168893 and T168894. [17:09:16] !log gehel@tin Started deploy [wdqs/wdqs@f8b9294]: (no justification provided) [17:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:58] !log gehel@tin Finished deploy [wdqs/wdqs@f8b9294]: (no justification provided) (duration: 03m 42s) [17:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:35] SMalyshev: wdqs deployment completed, tests are green and I see updates flowing in the logs [17:16:13] (03PS2) 10Dzahn: jenkins: lower console log spam [puppet] - 10https://gerrit.wikimedia.org/r/359116 (owner: 10Hashar) [17:30:46] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Cardamom → GreenC: supervision needed - https://phabricator.wikimedia.org/T168776#3379913 (10alanajjar) @Marostegui can we start now? [17:35:07] jouncebot: refresh [17:35:10] I refreshed my knowledge about deployments. [17:35:13] jouncebot: now [17:35:13] For the next 0 hour(s) and 24 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T1730) [17:35:50] !log resuming the train for wmf.6 which was blocked at group 1 [17:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:19] (03PS4) 10Framawiki: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [17:36:29] !log Deploying 1.30.0-wmf.6 to all wikis refs T167535 [17:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:37] T167535: MW-1.30.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T167535 [17:37:29] (03PS1) 1020after4: all wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361491 [17:37:31] (03CR) 1020after4: [C: 032] all wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361491 (owner: 1020after4) [17:38:58] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361491 (owner: 1020after4) [17:39:07] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361491 (owner: 1020after4) [17:41:21] (03PS5) 10BearND: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) [17:43:19] (03CR) 10Dzahn: [C: 032] jenkins: lower console log spam [puppet] - 10https://gerrit.wikimedia.org/r/359116 (owner: 10Hashar) [17:46:52] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.6 [17:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:01] (03CR) 10Framawiki: [C: 04-1] "Since the problem only concern one feed, the featured one, I think that it's better to use wmgFeaturedFeedsOverrides." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [17:47:49] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3379980 (10Jgreen) 05Open>03Resolved a:03Jgreen I think we resolved the core issues of this task, thus closing it. [17:47:52] (03CR) 10Framawiki: [C: 04-1] "And why do you want to edit "incubatorwiki"'s line ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [17:50:19] (03CR) 10Dzahn: Nrpe: Fix check_ram script to work on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361361 (owner: 10Paladox) [17:50:46] (03PS4) 10Paladox: Nrpe: Fix check_ram script to work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/361361 [17:51:06] (03CR) 10Paladox: Nrpe: Fix check_ram script to work on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361361 (owner: 10Paladox) [17:51:48] arlolra will be doing a parsoid deploy (early) so that mobrovac can do a restbase deploy after without having to stick around very late in his timezone. [17:51:49] (03CR) 10Dzahn: "could you paste the error you were getting please" [puppet] - 10https://gerrit.wikimedia.org/r/361361 (owner: 10Paladox) [17:52:03] just a heads up about that. [17:52:05] (03CR) 10Paladox: "I was getting UNKOWN." [puppet] - 10https://gerrit.wikimedia.org/r/361361 (owner: 10Paladox) [17:52:16] since this is 2 hours before the services deployment window. [17:52:34] (03CR) 10Paladox: "There was no specific error except from it doing UNKOWN when it had b in there. doing just 20 25 worked." [puppet] - 10https://gerrit.wikimedia.org/r/361361 (owner: 10Paladox) [17:58:08] (03CR) 10Dzahn: Nrpe: Fix check_ram script to work on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361361 (owner: 10Paladox) [17:59:42] (03CR) 10BearND: "@Framawiki Sorry about that earlier PS. I was just rebasing it manually since Gerrit said it couldn't merge it. I did not want to edit the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T1800). [18:00:04] framawiki: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:11] Me! I'll SWAT. [18:00:49] o/ [18:01:14] framawiki: I believe this has consensus for deployment on huwikibooks, right? [18:01:45] yes :) [18:01:48] https://hu.wikibooks.org/w/index.php?title=Wikik%C3%B6nyvek:T%C3%A1rsalg%C3%B3&oldid=288515#A_kv.C3.ADz_kiterjeszt.C3.A9s_telep.C3.ADt.C3.A9se [18:02:35] (03PS2) 10Niharika29: Enable Quiz extension on huwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361084 (https://phabricator.wikimedia.org/T168471) (owner: 10Framawiki) [18:02:43] (03CR) 10Niharika29: [C: 032] Enable Quiz extension on huwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361084 (https://phabricator.wikimedia.org/T168471) (owner: 10Framawiki) [18:03:43] (03Merged) 10jenkins-bot: Enable Quiz extension on huwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361084 (https://phabricator.wikimedia.org/T168471) (owner: 10Framawiki) [18:04:05] (03CR) 10Framawiki: [C: 04-1] "Ok, no problem :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [18:06:05] (03CR) 10jenkins-bot: Enable Quiz extension on huwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361084 (https://phabricator.wikimedia.org/T168471) (owner: 10Framawiki) [18:09:30] framawiki: It's on mwdebug1002 if you wanna test? [18:10:51] Nikerabbit: confirmed on Special:version [18:11:02] (03PS6) 10BearND: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) [18:11:51] !log niharika29@tin Started scap: wmf-config/InitialiseSettings.php Deploy Quiz extension on huwikibooks (https://gerrit.wikimedia.org/r/#/c/361084) [18:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:08] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:14:08] !log niharika29@tin scap failed: RuntimeError scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) (duration: 02m 15s) [18:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:21] (03PS7) 10BearND: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:27] !log niharika29@tin Started scap: wmf-config/InitialiseSettings.php Deploy Quiz extension on huwikibooks (https://gerrit.wikimedia.org/r/#/c/361084) [18:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:41] !log niharika29@tin Finished scap: wmf-config/InitialiseSettings.php Deploy Quiz extension on huwikibooks (https://gerrit.wikimedia.org/r/#/c/361084) (duration: 03m 14s) [18:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:58] framawiki: All done. [18:19:22] confirmed on prod, thanks Niharika [18:19:41] framawiki: Thanks. [18:19:52] SWAT over. [18:20:46] !log arlolra@tin Started deploy [parsoid/deploy@70538a6]: Updating Parsoid to b59045f2 [18:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:45] (03PS1) 10Ppchelko: Expose mediawiki.revision-create stream from eventstreams. [puppet] - 10https://gerrit.wikimedia.org/r/361497 (https://phabricator.wikimedia.org/T167670) [18:30:59] (03CR) 10BearND: "@Framawiki I've removed the incubatewiki line edit and moved my change over to the override block. I hope that the structure is correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [18:31:21] (03PS8) 10BearND: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) [18:31:59] !log arlolra@tin Finished deploy [parsoid/deploy@70538a6]: Updating Parsoid to b59045f2 (duration: 11m 13s) [18:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:48] !log T160570: Upgrading restbase-dev1001 to Cassandra 3.11.0 (release) [18:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:59] T160570: Cassandra 3.x Tracking - https://phabricator.wikimedia.org/T160570 [18:35:38] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3380110 (10Dzahn) summary of the issue i'm seeing. The context is "rancid". What it does is ssh to all the switches, download config, compare to local git repo, if there are diffs then send email abou... [18:41:27] 10Operations, 10Traffic, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3380116 (10ema) Just a few (partial) answers so far, but here we go! >>! In T164768#3374941, @BBlack wrote: > 1) cache_misc still has a `do_stream = false` case on the backend... [18:49:26] (03CR) 10Mobrovac: [C: 031] "We should schedule this for PuppetSWAT" [puppet] - 10https://gerrit.wikimedia.org/r/361497 (https://phabricator.wikimedia.org/T167670) (owner: 10Ppchelko) [18:50:00] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3380121 (10Dzahn) `strace -f` with one of those ends in: ``` [pid 24567] close(1) = 0 [pid 24567] fstat(1, 0x7fff45095130) = -1 EBADF (Bad file descriptor) [pid 24567] exit_grou... [18:51:55] !log Updated Parsoid to b59045f2 (T39902, T149794) [18:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:05] T39902: RFC: Implement rendering of redlinks in Parsoid HTML as post-processor - https://phabricator.wikimedia.org/T39902 [18:52:06] T149794: Mark disambiguation links in Parsoid - https://phabricator.wikimedia.org/T149794 [18:52:58] (03PS1) 10Herron: Change lists.wikimedia.org SPF record to soft fail (~all) [dns] - 10https://gerrit.wikimedia.org/r/361501 (https://phabricator.wikimedia.org/T167703) [18:59:17] !log mobrovac@tin Started deploy [restbase/deploy@3975ab2]: Update Parsoid HTML version to 1.5.0 - T39902 [18:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:27] T39902: RFC: Implement rendering of redlinks in Parsoid HTML as post-processor - https://phabricator.wikimedia.org/T39902 [19:05:33] !log mobrovac@tin Finished deploy [restbase/deploy@3975ab2]: Update Parsoid HTML version to 1.5.0 - T39902 (duration: 06m 16s) [19:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:43] T39902: RFC: Implement rendering of redlinks in Parsoid HTML as post-processor - https://phabricator.wikimedia.org/T39902 [19:09:45] (03PS1) 10Andrew Bogott: Labtestpuppetmaster2001: Initial node definition [puppet] - 10https://gerrit.wikimedia.org/r/361504 [19:11:08] 10Operations, 10Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3380185 (10BBlack) This task has gotten a bit confusing. Stepping back a bit from the specific case of Commons (be... [19:14:25] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:14:35] PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused [19:14:51] got it ^^^ [19:14:55] PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:16:55] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:17:05] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:18:20] 10Operations, 10Wikimedia-General-or-Unknown, 10I18n: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there - https://phabricator.wikimedia.org/T166782#3307409 (10BBlack) >>! In T166782#3307457, @Amire80 wrote: > I guess that the traffic to wikimediafoundatio... [19:18:55] RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 193 days) [19:18:56] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [19:19:05] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [19:19:35] RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.000 second response time on 10.64.0.36 port 9042 [19:23:07] (03PS1) 10Eevans: Limit mmap to indexes (work-around for abnormal page faults) [puppet] - 10https://gerrit.wikimedia.org/r/361506 (https://phabricator.wikimedia.org/T137419) [19:24:55] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [19:29:44] (03CR) 10Eevans: [C: 031] "This is a similar workaround to what we already have in place on Cassandra 2.2 (I'm sure it's literally the same bug). This specific chan" [puppet] - 10https://gerrit.wikimedia.org/r/361506 (https://phabricator.wikimedia.org/T137419) (owner: 10Eevans) [19:30:54] !log T160570: Upgrading restbase-dev1002 to Cassandra 3.11.0 (release) [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:04] T160570: Cassandra 3.x Tracking - https://phabricator.wikimedia.org/T160570 [19:35:15] !log T160570: Upgrading restbase-dev1003 to Cassandra 3.11.0 (release) [19:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:00] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10Anomie) It'd be nice if you define "API": are you talking about res... [19:41:57] !log Restarted Jenkins to lower console log spam ( https://gerrit.wikimedia.org/r/#/c/359116/ ) [19:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:46] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3380328 (10Dzahn) Issue has been found. Permissions in git bare repo and working dir. some files not owned by rancid:rancid as they should. Works now. thanks Apergos for seeing it. It looks all good now. [19:49:41] (03PS1) 10Dzahn: netmon: remove rancid role from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/361510 [19:49:55] PROBLEM - Check size of conntrack table on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:56] PROBLEM - puppet last run on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:06] PROBLEM - dhclient process on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:15] PROBLEM - salt-minion processes on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:15] PROBLEM - Check systemd state on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:16] PROBLEM - configured eth on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:25] PROBLEM - DPKG on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:26] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:35] PROBLEM - Disk space on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:36] PROBLEM - Hadoop NodeManager on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:37] PROBLEM - Disk space on Hadoop worker on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:39] PROBLEM - Hadoop DataNode on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:52:16] (03PS2) 10Dzahn: netmon: remove rancid role from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/361510 [19:52:36] PROBLEM - SSH on analytics1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:41] checking analytics1050 [19:54:35] thanks! i was about to open mgmt [19:54:35] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:54:59] (03CR) 10Dzahn: [C: 032] netmon: remove rancid role from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/361510 (owner: 10Dzahn) [19:55:16] RECOVERY - configured eth on analytics1050 is OK: OK - interfaces up [19:55:25] RECOVERY - Check systemd state on analytics1050 is OK: OK - running: The system is fully operational [19:55:25] RECOVERY - DPKG on analytics1050 is OK: All packages OK [19:55:25] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1050 is OK: OK ferm input default policy is set [19:55:25] RECOVERY - YARN NodeManager Node-State on analytics1050 is OK: OK: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: RUNNING [19:55:27] RECOVERY - Disk space on analytics1050 is OK: DISK OK [19:55:35] RECOVERY - SSH on analytics1050 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:55:35] RECOVERY - Hadoop NodeManager on analytics1050 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [19:55:36] RECOVERY - Disk space on Hadoop worker on analytics1050 is OK: DISK OK [19:55:38] RECOVERY - Hadoop DataNode on analytics1050 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [19:55:55] RECOVERY - Check size of conntrack table on analytics1050 is OK: OK: nf_conntrack is 1 % full [19:55:55] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [19:56:01] auto recovery? [19:56:05] RECOVERY - dhclient process on analytics1050 is OK: PROCS OK: 0 processes with command name dhclient [19:56:15] RECOVERY - salt-minion processes on analytics1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:56:22] !log updated ops list accept_these_nonmembers regex (T168903) [19:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T2000). [20:02:39] greg-g: we are not finished building new models so we can't deploy today but it's urgent (https://phabricator.wikimedia.org/T168773) [20:03:45] greg-g: can we deploy around 2200 UTC? [20:07:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:09:36] thanks, though my name is on as clinic I admit I'm mostly checked out at this point (11 pm) [20:10:11] parsoid deploy already done. [20:26:57] (03CR) 10Mobrovac: [C: 031] Limit mmap to indexes (work-around for abnormal page faults) [puppet] - 10https://gerrit.wikimedia.org/r/361506 (https://phabricator.wikimedia.org/T137419) (owner: 10Eevans) [20:27:32] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Green Cardamom → GreenC: supervision needed - https://phabricator.wikimedia.org/T168776#3376531 (10Luke081515) @alanajjar I guess it would make more sense to ping him at IRC, so if there is a problem, you can communicate much faster. [20:29:21] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3380415 (10Luke081515) Adding #DBA to get an opinion from one of them. [20:29:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:34:32] !log bsitzmann@tin Started deploy [mobileapps/deploy@07066c7]: Update mobileapps to 0b05026 [20:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:13] !log bsitzmann@tin Finished deploy [mobileapps/deploy@07066c7]: Update mobileapps to 0b05026 (duration: 03m 41s) [20:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:36] (03PS1) 10Hashar: jenkins: lack of access_log produce invalid system unit [puppet] - 10https://gerrit.wikimedia.org/r/361551 [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T2100). [21:00:09] !log attempting firmware update on lvs1007, which is currently offline [21:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:41] huh, sending file via ssh tunnel to the lan, and then having it skip out onto the slower mgmt network, does not make a fast upload. [21:05:14] (03Abandoned) 10Niharika29: Deploy and enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353352 (owner: 10Niharika29) [21:10:56] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3323716 (10RobH) So attempting to upload this over the web to the gui mgmt interface times out. It may work a bit better if done locally from eqiad. I'm pushing the file to my home... [21:17:38] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380505 (10Jgreen) [21:17:52] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [21:17:52] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [21:22:36] 10Operations, 10Fundraising-Backlog, 10Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3380515 (10Jgreen) [21:26:44] (03PS2) 10Andrew Bogott: Labtestpuppetmaster2001: Initial node definition [puppet] - 10https://gerrit.wikimedia.org/r/361504 [21:26:59] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380529 (10CCogdill_WMF) Major Gifts is discontinuing the events integration we had in place through these sites. The contract ends at the end of the month, so I'm pretty... [21:27:32] !log deployed patch for T128209 [21:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:42] (03CR) 10Andrew Bogott: [C: 032] Labtestpuppetmaster2001: Initial node definition [puppet] - 10https://gerrit.wikimedia.org/r/361504 (owner: 10Andrew Bogott) [21:33:32] PROBLEM - DPKG on labtestpuppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:35:32] RECOVERY - DPKG on labtestpuppetmaster2001 is OK: All packages OK [21:47:33] PROBLEM - puppetmaster https on labtestpuppetmaster2001 is CRITICAL: connect to address 208.80.153.108 and port 8140: Connection refused [21:50:12] !log shutting down and decommissioning mw117[0-9] per T168271 [21:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:22] T168271: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271 [21:51:10] (03PS1) 10Hashar: aptrepo: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/361577 [21:52:43] (03PS1) 10Hashar: jenkins: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/361578 [21:53:54] (03PS1) 10Hashar: nrpe: fix spec (depends on stdlib) [puppet] - 10https://gerrit.wikimedia.org/r/361580 [21:56:00] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3380614 (10RobH) [21:56:05] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3359900 (10RobH) ge-6/0/9 up up mw1170 ge-6/0/10 up up mw1171 ge-6/0/11 up up mw1172 ge-6/0/12 up up mw1173 g... [21:58:11] 10Operations, 10ops-eqiad, 10hardware-requests, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3380622 (10RobH) [22:00:59] (03PS1) 10RobH: decom mw117[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/361581 [22:03:06] (03PS1) 10RobH: mw117[0-9] decommission [dns] - 10https://gerrit.wikimedia.org/r/361582 [22:03:30] (03CR) 10RobH: [C: 032] decom mw117[0-9] [puppet] - 10https://gerrit.wikimedia.org/r/361581 (owner: 10RobH) [22:03:53] (03CR) 10RobH: [C: 032] mw117[0-9] decommission [dns] - 10https://gerrit.wikimedia.org/r/361582 (owner: 10RobH) [22:07:29] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3380657 (10RobH) [22:07:35] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380658 (10BBlack) Yeah we can close this task if the sites are gone. We'll want to remove the current IP address mapping for these hostnames from our DNS when this happ... [22:07:59] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3359900 (10RobH) a:03Cmjohnson All non-onsite steps have been completed, and these hosts now await disk wipes. [22:14:10] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380675 (10CCogdill_WMF) As far as I know, it is just those first two subdomains you listed. I'm not sure benefactors.wikimedia.org goes anywhere, anyway... [22:20:41] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#2359459 (10Dzahn) benefactors.wikimedia.org may not be used for HTTP(S) but it is apparently used for email? T130937 https://phabricator.wikimedia.org/rODNSc6dc7dcb64c4... [22:21:38] (03PS2) 10Dzahn: netmon1002: add smokeping role [puppet] - 10https://gerrit.wikimedia.org/r/361191 (https://phabricator.wikimedia.org/T159756) [22:22:39] (03PS1) 10Hashar: tilerator: mock secret() [puppet] - 10https://gerrit.wikimedia.org/r/361586 [22:23:04] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2025590 [22:24:30] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3380694 (10BBlack) @Jseddon @Mbeat33 - ping again? The redirect appears to work currently, but still no HSTS header. [22:25:40] !log halfak@tin Started deploy [ores/deploy@82dfd56]: Unscheduled/urgent deploy (T168099) [22:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:50] T168099: Mid June 2017 ORES deployment - https://phabricator.wikimedia.org/T168099 [22:26:49] (03PS1) 10Legoktm: Enable 'Linter' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361589 [22:26:57] jouncebot: now [22:26:57] For the next 0 hour(s) and 33 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T2100) [22:27:00] (03PS1) 10Hashar: servermon: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/361590 [22:27:01] jouncebot: next [22:27:01] In 0 hour(s) and 32 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T2300) [22:27:36] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3380705 (10BBlack) [22:27:39] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: Canonical URL in Store points to HTTP address, should be HTTPS - https://phabricator.wikimedia.org/T131131#3380702 (10BBlack) 05Open>03Resolved a:03BBlack Currently this looks to be fixed. The relevant snippet on the live store site is now: ```... [22:27:48] !log netmon1001 - deactivate rancid crons - now running on netmon1002 instead - avoid duplicate mails (T159756) [22:27:52] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380706 (10CCogdill_WMF) Oh, I see. I'm not entirely sure about this. @DKaufman I'm trying to identify which domains are getting phased out with the Trilogy system. We u... [22:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:58] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [22:28:07] (03CR) 10jerkins-bot: [V: 04-1] servermon: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/361590 (owner: 10Hashar) [22:29:32] (03CR) 10Legoktm: [C: 032] Enable 'Linter' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361589 (owner: 10Legoktm) [22:30:04] I'm going to deploy some Linter debug logging right now [22:30:35] (03Merged) 10jenkins-bot: Enable 'Linter' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361589 (owner: 10Legoktm) [22:30:49] (03CR) 10jenkins-bot: Enable 'Linter' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361589 (owner: 10Legoktm) [22:32:10] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Enable 'Linter' debug log channel (duration: 00m 44s) [22:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:35] !log legoktm@tin Synchronized php-1.30.0-wmf.6/extensions/Linter/includes/ApiRecordLint.php: Add debug logging for missing 'dsr' - T168900 (duration: 00m 43s) [22:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:45] T168900: Notice: Undefined index: dsr in /extensions/Linter/includes/ApiRecordLint.php on line 65 - https://phabricator.wikimedia.org/T168900 [22:36:53] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380731 (10awight) 05Resolved>03Open Looks like I'll need shell access to scb1002.eqiad.wmnet, in order to do canary tests while deploying.... [22:45:24] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [22:49:28] !log Updated LDAP loginShell to /bin/bash for 969 accounts that were still set to /usr/local/bin/sillyshell (T86668) [22:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:39] T86668: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668 [22:49:48] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380759 (10awight) Sounds like I'll need shell on scb[1-2]* and also the ores-admin group, so I can do terrible things on production boxes. [22:50:22] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380762 (10Halfak) [22:51:44] anomie: I'm gonna add https://gerrit.wikimedia.org/r/#/c/361508/ to SWAT [22:53:33] (03PS1) 10Ladsgroup: Add awight to ores-admins [puppet] - 10https://gerrit.wikimedia.org/r/361593 (https://phabricator.wikimedia.org/T168442) [22:54:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [22:55:06] 10Operations, 10LDAP-Access-Requests, 10Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3380786 (10bd808) The `sillyshell` entries have been cleaned up. ``` $ ldapsearch -xLLL -E pr=40000/noprompt -b 'dc=wikimedia,dc=org' '(&(objectC... [22:55:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [22:56:35] !log halfak@tin Finished deploy [ores/deploy@82dfd56]: Unscheduled/urgent deploy (T168099) (duration: 30m 55s) [22:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:45] T168099: Mid June 2017 ORES deployment - https://phabricator.wikimedia.org/T168099 [22:57:02] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380797 (10Ladsgroup) This is the only thing that needs to be done [22:57:53] (03CR) 10RobH: [C: 04-1] "This cannot merge without operations meeting review, as it grants sudo access." [puppet] - 10https://gerrit.wikimedia.org/r/361593 (https://phabricator.wikimedia.org/T168442) (owner: 10Ladsgroup) [22:58:47] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380801 (10RobH) 05Open>03Resolved Addition to the ores-admins is a sudo group, and thus will require review during the... [22:59:51] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380805 (10RobH) 05Resolved>03Open Also no one reopened this when requesting more rights be added, opening it back up now. [22:59:55] 10Operations, 10Traffic, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#3380808 (10BBlack) 05Open>03Resolved a:03BBlack This has been working for some time, at least for the HTTPS issue at the root as tasked here! The other part about doc... [23:00:17] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170626T2300). [23:00:17] foks and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:02] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team-Backlog, 10Patch-For-Review: Grant AWight accounts on ores production clusters - https://phabricator.wikimedia.org/T168442#3380813 (10awight) My fault, I've been flapping this task like crazy... T168442#3380731 Thanks for taking a look! [23:02:02] hey, I'm in a standup right now, will deploy my patch myself later this window [23:02:10] i'm here tho [23:02:19] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380816 (10BBlack) [23:02:43] won't be able to test while here - it's pushing to a submodule I'm working out how to actually access :) [23:03:12] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380831 (10BBlack) [23:03:23] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380832 (10BBlack) [23:06:48] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380851 (10BBlack) The original point of this (now ~2 years old) tracking task was to track the very long tail of known but relatively-minor issues preventing us from reaching... [23:06:54] (03CR) 10Dzahn: [C: 032] netmon1002: add smokeping role [puppet] - 10https://gerrit.wikimedia.org/r/361191 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [23:07:22] (03PS1) 10BryanDavis: nslcd: Remove Labs shell override [puppet] - 10https://gerrit.wikimedia.org/r/361595 (https://phabricator.wikimedia.org/T86668) [23:07:37] * twentyafterfour can deploy https://gerrit.wikimedia.org/r/#/c/361508/ [23:07:39] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380860 (10BBlack) [23:07:42] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3380859 (10BBlack) [23:07:52] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10BBlack) [23:07:55] 10Operations, 10Traffic, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10BBlack) [23:08:06] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [23:08:11] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884289 (10BBlack) [23:09:12] 10Operations, 10Traffic, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#3380867 (10BBlack) [23:09:15] 10Operations, 10Traffic, 10HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10BBlack) [23:13:03] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380869 (10BBlack) [23:23:34] !log deploying https://gerrit.wikimedia.org/r/#/c/361508 [23:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:43] (03CR) 10MaxSem: [C: 032] Adding font files and license [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/361195 (https://phabricator.wikimedia.org/T168757) (owner: 10Foks) [23:24:11] (03CR) 10MaxSem: [V: 032 C: 032] Adding font files and license [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/361195 (https://phabricator.wikimedia.org/T168757) (owner: 10Foks) [23:24:21] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.6/extensions/Scribunto/engines/LuaSandbox/Engine.php: deploy https://gerrit.wikimedia.org/r/#/c/361508 (duration: 00m 43s) [23:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:04] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380884 (10BBlack) [23:26:43] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [23:27:35] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [23:29:52] hrm... [23:30:29] 10Operations, 10ops-codfw, 10hardware-requests: reclaim/decom tmh200[12] - https://phabricator.wikimedia.org/T168472#3380906 (10RobH) a:05RobH>03Papaul [23:31:32] foks, I'm preparing a config change to update your submodule [23:31:36] 10Operations, 10ops-codfw, 10hardware-requests: reclaim/decom tmh200[12] - https://phabricator.wikimedia.org/T168472#3365481 (10RobH) Please note these systems were never setup, are powered down, but were previously installed in Tampa. They likely need their disks wiped, but they were also never setup on th... [23:31:52] MaxSem, ok. as far as I know, it is demon's modules [23:31:54] * module [23:32:06] but by all means! [23:32:09] 10Operations, 10ops-codfw, 10hardware-requests: reclaim/decom tmh200[12] - https://phabricator.wikimedia.org/T168472#3380912 (10RobH) [23:32:30] it's possible that's not even the correct place to push those to [23:33:31] (03PS1) 10MaxSem: Fonts: update submodule to https://gerrit.wikimedia.org/r/#/c/361195/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361598 [23:33:56] (03PS1) 10Dzahn: smokeping: ensure dnsutils is installed, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/361599 [23:34:39] (03CR) 10MaxSem: [C: 032] "Schwatt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361598 (owner: 10MaxSem) [23:36:20] (03PS2) 10Dzahn: smokeping: ensure dnsutils is installed, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/361599 [23:36:30] (03Merged) 10jenkins-bot: Fonts: update submodule to https://gerrit.wikimedia.org/r/#/c/361195/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361598 (owner: 10MaxSem) [23:36:40] (03CR) 10jenkins-bot: Fonts: update submodule to https://gerrit.wikimedia.org/r/#/c/361195/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361598 (owner: 10MaxSem) [23:37:58] (03CR) 10Dzahn: [C: 032] smokeping: ensure dnsutils is installed, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/361599 (owner: 10Dzahn) [23:38:50] !log maxsem@tin Synchronized fonts/: https://gerrit.wikimedia.org/r/361195 (duration: 00m 45s) [23:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:00] foks, ^ [23:40:16] thanks! [23:40:27] i am not entirely sure what you did [23:41:48] cd mediawiki-config && git submodule update --init fonts && cd fonts && git fetch && git checkout origin/master && cd .. && git add fonts && git commit -m ... [23:44:24] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [23:44:38] oh ym [23:44:41] * my :P [23:50:10] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380934 (10BBlack) @Ottomata - Any high level new info about timetables for deprecating and then removing the RCStream stuff in favor of EventStreams ( T130651... [23:50:40] (03PS1) 10RobH: decom mw2098 [dns] - 10https://gerrit.wikimedia.org/r/361603 [23:51:04] (03CR) 10RobH: [C: 032] decom mw2098 [dns] - 10https://gerrit.wikimedia.org/r/361603 (owner: 10RobH) [23:51:33] !log maxsem@tin Synchronized php-1.30.0-wmf.6/extensions/Kartographer/: https://gerrit.wikimedia.org/r/#/c/361584/ (duration: 00m 44s) [23:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:47] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review: Decomission mw2098 - https://phabricator.wikimedia.org/T164959#3380943 (10RobH) 05Open>03Resolved a:05RobH>03None [23:52:44] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380949 (10BBlack) ( Note also ori did a soft announce of HTTPS transition for it about a year ago, but with no target date for disabling plain HTTP: https://l... [23:53:04] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 9